What Is a Voice Provider?
A voice provider wraps a speech service behind a simple interface. It can do one or more of three things:- Speak — convert text to audio (text-to-speech / TTS)
- Listen — convert audio to text (speech-to-text / STT)
- Realtime — bidirectional audio streaming over a WebSocket (speech-to-speech)
The <Voice> Component
Wrap a subtree with <Voice> and every task inside inherits that voice provider:
<Voice> component does not execute anything itself. It annotates the tasks beneath it, the same way <Worktree> annotates tasks with a filesystem path or <Parallel> annotates them with concurrency limits.
Tasks inside a <Voice> scope receive voice and voiceSpeaker on their descriptors. The engine uses these to call voice.listen() when the task needs audio input or voice.speak() when it produces audio output.
Batch vs Realtime
Two fundamentally different modes. Batch is what most people need. Batch: send a blob of text, get a blob of audio back (or vice versa). One request, one response. The AI SDK’sexperimental_generateSpeech and experimental_transcribe handle this. It works with OpenAI, ElevenLabs, Deepgram, and others — any provider the AI SDK supports.
Realtime: open a persistent WebSocket, stream audio in both directions simultaneously. OpenAI’s Realtime API does this. Latency is low, but the protocol is more complex. Smithers provides createOpenAIRealtimeVoice() for this case because the AI SDK does not cover it.
Most workflows should start with batch. Reach for realtime only when you need live conversation.
Composite Voice
What if you want Deepgram for transcription but ElevenLabs for speech? Composite voice mixes providers:voice.listen(), it routes to Deepgram. When it calls voice.speak(), it routes to ElevenLabs. If you also set a realtime provider, it takes priority for both operations.
Effect Service Layer
For power users who build with Effect.ts directly, voice exposes an Effect service:VoiceService tag lets you inject a voice provider into any Effect pipeline. The speak() and listen() functions pull it from context automatically.
Events and Observability
Voice operations emit structured events:VoiceStarted— a voice operation began (speak or listen)VoiceFinished— it completed successfullyVoiceError— something went wrong
smithers.voice.operations_total counter and smithers.voice.duration_ms histogram track volume and latency.