What Is a Voice Provider?
A voice provider wraps a speech service behind a simple interface. It can do one or more of three things:- Speak — convert text to audio (text-to-speech / TTS)
- Listen — convert audio to text (speech-to-text / STT)
- Realtime — bidirectional audio streaming over a WebSocket (speech-to-speech)
The <Voice> Component
Wrap a subtree with <Voice> and every task inside inherits that voice provider:
<Voice> component does not execute anything itself. It annotates the tasks beneath it, the same way <Worktree> annotates tasks with a filesystem path or <Parallel> annotates them with concurrency limits.
Tasks inside a <Voice> scope receive voice and voiceSpeaker on their descriptors. The engine uses these to call voice.listen() when the task needs audio input or voice.speak() when it produces audio output.
Batch vs Realtime
Two fundamentally different modes. Batch is what most people need. Batch: send a blob of text, get a blob of audio back (or vice versa). One request, one response. The AI SDK’sexperimental_generateSpeech and experimental_transcribe handle this. It works with OpenAI, ElevenLabs, Deepgram, and others — any provider the AI SDK supports.
Realtime: open a persistent WebSocket, stream audio in both directions simultaneously. OpenAI’s Realtime API does this. Latency is low, but the protocol is more complex. Smithers provides createOpenAIRealtimeVoice() for this case because the AI SDK does not cover it.
Most workflows should start with batch. Reach for realtime only when you need live conversation.
Composite Voice
What if you want Deepgram for transcription but ElevenLabs for speech? Composite voice mixes providers:voice.listen(), it routes to Deepgram. When it calls voice.speak(), it routes to ElevenLabs. If you also set a realtime provider, it takes priority for both operations.
Effect Service Layer
For power users who build with Effect.ts directly, voice exposes an Effect service:VoiceService tag lets you inject a voice provider into any Effect pipeline. The speak() and listen() functions pull it from context automatically.
For scoped lifecycle management (automatic connect() and close()), use createVoiceServiceLayer():
connect() when the scope opens and close() when it closes.
Listing Available Speakers
Every voice provider exposesgetSpeakers(), which returns the list of voices that provider supports:
alloy, ash, ballad, coral, echo, sage, shimmer, and verse.
For composite voice, getSpeakers() delegates to the realtime provider if one is set, otherwise to the output (TTS) provider. If neither is configured, it returns an empty array.
Updating Voice Config at Runtime
You can update voice session parameters after initialization without reconnecting. CallupdateConfig with any session-level settings the provider understands:
updateConfig sends a session.update event over the existing WebSocket. Changes take effect for subsequent interactions in the same session. For composite voice, updateConfig delegates to the realtime provider.
Manually Triggering a Realtime Response
In realtime (speech-to-speech) mode, OpenAI’s server can detect speech automatically. But you can also trigger a response explicitly withanswer():
answer() sends a response.create event to the WebSocket. Any options you pass are forwarded as response properties. Call it when you want the model to respond immediately without waiting for voice activity detection.
Overriding the WebSocket URL
The default WebSocket endpoint for OpenAI Realtime iswss://api.openai.com/v1/realtime. Override it with the url config option:
?model=...), so the full connection URL becomes wss://my-proxy.example.com/realtime?model=gpt-4o-mini-realtime-preview-2024-12-17. Use this for proxies, local development stubs, or alternative endpoints.
Configuring the Transcription Model
By default, the OpenAI Realtime provider transcribes incoming audio withwhisper-1. Change the transcription model with the transcriber config option:
session.update call immediately after connection. It controls how the realtime API transcribes user audio for the input_audio_transcription session property.
Audio Format Support
When callingspeak(), you can request a specific audio format via the format option:
| Format | Description |
|---|---|
mp3 | MPEG Layer 3 — widely compatible, lossy |
wav | Waveform Audio — uncompressed, lossless |
pcm | Raw PCM — no header, lowest overhead |
opus | Opus codec — low latency, good for streaming |
flac | Free Lossless Audio Codec |
aac | Advanced Audio Coding — good compression |
AudioFormat type is exported from smithers-orchestrator/voice for type-safe usage.
Provider-Level Event Callbacks
Realtime voice providers emit events that you can subscribe to withon() and unsubscribe from with off():
| Event | Payload | When |
|---|---|---|
speaking | { audio, response_id } | Each chunk of audio output from the model |
writing | { text, role, response_id } | Each chunk of text transcription |
error | { message, code?, details? } | A provider-level error occurred |
speaker | ReadableStream | A new audio response stream was created |
VoiceStarted, VoiceFinished, VoiceError) which track operation lifecycle at the workflow level.
Default Speaker Selection
If you don’t specify aspeaker prop on <Voice> or a speaker option in the provider config, the default depends on the provider:
- OpenAI Realtime: defaults to
"alloy" - AI SDK Voice: no default — you must pass a speaker via
SpeakOptionsor the provider config, or the underlying model’s default is used - Composite Voice: delegates to whichever sub-provider handles the operation
- Per-call:
voice.speak("text", { speaker: "shimmer" }) - Per-component:
<Voice provider={voice} speaker="coral"> - Per-provider:
createOpenAIRealtimeVoice({ speaker: "echo" })
OpenAI Realtime: API Key and Environment Fallback
The OpenAI Realtime provider resolves API keys in this order:- The
apiKeyconfig option passed tocreateOpenAIRealtimeVoice() - The
OPENAI_API_KEYenvironment variable
connect() throws an error.
OpenAI Realtime: Model Override
Override the realtime model with themodel config option:
gpt-4o-mini-realtime-preview-2024-12-17. The model name is appended as a query parameter to the WebSocket URL.
OpenAI Realtime: Session Management
The OpenAI Realtime provider manages WebSocket session lifecycle automatically:connect()opens a WebSocket, waits for thesession.createdevent, then sends an initialsession.updateto configure the transcription model and default voice.- While connected, any calls to
send(),speak(),listen(), oranswer()use the active session. close()tears down the connection, cleans up speaker streams, and releases resources.
session.created yourself — connect() returns only after the session is fully initialized.
connect() while already connected, it returns immediately. Concurrent calls to connect() are deduplicated — only one connection attempt runs at a time.
Events and Observability
Voice operations emit structured events:VoiceStarted— a voice operation began (speak or listen)VoiceFinished— it completed successfullyVoiceError— something went wrong
smithers.voice.operations_total counter and smithers.voice.duration_ms histogram track volume and latency.