Skip to main content
Your workflow orchestrates code reviews, generates reports, analyzes data — all in text. But some tasks start with an audio recording or need to produce spoken output. Maybe you have a meeting transcript to analyze, or you want your pipeline to read results aloud. That is what voice providers are for.

What Is a Voice Provider?

A voice provider wraps a speech service behind a simple interface. It can do one or more of three things:
  1. Speak — convert text to audio (text-to-speech / TTS)
  2. Listen — convert audio to text (speech-to-text / STT)
  3. Realtime — bidirectional audio streaming over a WebSocket (speech-to-speech)
You pick the provider, configure it once, and hand it to your tasks. Smithers handles the wiring.
import { createAiSdkVoice } from "smithers-orchestrator/voice";
import { openai } from "@ai-sdk/openai";

const voice = createAiSdkVoice({
  speechModel: openai.speech("tts-1"),
  transcriptionModel: openai.transcription("whisper-1"),
});
That single object now speaks and listens. The AI SDK handles the actual API calls; smithers gives you the integration layer.

The <Voice> Component

Wrap a subtree with <Voice> and every task inside inherits that voice provider:
<Voice provider={voice} speaker="alloy">
  <Task id="transcribe" output={outputs.transcript} agent={myAgent}>
    Transcribe the uploaded audio file.
  </Task>
  <Task id="summarize" output={outputs.summary} agent={myAgent}>
    Summarize the transcript.
  </Task>
</Voice>
The <Voice> component does not execute anything itself. It annotates the tasks beneath it, the same way <Worktree> annotates tasks with a filesystem path or <Parallel> annotates them with concurrency limits. Tasks inside a <Voice> scope receive voice and voiceSpeaker on their descriptors. The engine uses these to call voice.listen() when the task needs audio input or voice.speak() when it produces audio output.

Batch vs Realtime

Two fundamentally different modes. Batch is what most people need. Batch: send a blob of text, get a blob of audio back (or vice versa). One request, one response. The AI SDK’s experimental_generateSpeech and experimental_transcribe handle this. It works with OpenAI, ElevenLabs, Deepgram, and others — any provider the AI SDK supports. Realtime: open a persistent WebSocket, stream audio in both directions simultaneously. OpenAI’s Realtime API does this. Latency is low, but the protocol is more complex. Smithers provides createOpenAIRealtimeVoice() for this case because the AI SDK does not cover it. Most workflows should start with batch. Reach for realtime only when you need live conversation.

Composite Voice

What if you want Deepgram for transcription but ElevenLabs for speech? Composite voice mixes providers:
import { createCompositeVoice, createAiSdkVoice } from "smithers-orchestrator/voice";

const listener = createAiSdkVoice({
  transcriptionModel: deepgram.transcription("nova-3"),
});
const speaker = createAiSdkVoice({
  speechModel: elevenlabs.speech("eleven_multilingual_v2"),
});

const voice = createCompositeVoice({
  input: listener,
  output: speaker,
});
When a task calls voice.listen(), it routes to Deepgram. When it calls voice.speak(), it routes to ElevenLabs. If you also set a realtime provider, it takes priority for both operations.

Effect Service Layer

For power users who build with Effect.ts directly, voice exposes an Effect service:
import { VoiceService, speak, listen } from "smithers-orchestrator/voice";
import { Effect } from "effect";

const program = Effect.gen(function* () {
  const transcript = yield* listen(audioStream);
  const audio = yield* speak(`Summary: ${transcript}`);
  return audio;
}).pipe(Effect.provideService(VoiceService, myVoice));
The VoiceService tag lets you inject a voice provider into any Effect pipeline. The speak() and listen() functions pull it from context automatically. For scoped lifecycle management (automatic connect() and close()), use createVoiceServiceLayer():
import { createVoiceServiceLayer, speak } from "smithers-orchestrator/voice";
import { Effect, Layer } from "effect";

const voiceLayer = createVoiceServiceLayer(realtimeVoice);

const program = Effect.gen(function* () {
  const audio = yield* speak("Hello from Effect");
  return audio;
}).pipe(Effect.provide(voiceLayer));
The layer handles calling connect() when the scope opens and close() when it closes.

Listing Available Speakers

Every voice provider exposes getSpeakers(), which returns the list of voices that provider supports:
const speakers = await voice.getSpeakers();
// [{ voiceId: "alloy" }, { voiceId: "echo" }, ...]
For the OpenAI Realtime provider, this returns the eight built-in voices: alloy, ash, ballad, coral, echo, sage, shimmer, and verse. For composite voice, getSpeakers() delegates to the realtime provider if one is set, otherwise to the output (TTS) provider. If neither is configured, it returns an empty array.

Updating Voice Config at Runtime

You can update voice session parameters after initialization without reconnecting. Call updateConfig with any session-level settings the provider understands:
voice.updateConfig({
  voice: "shimmer",
  turn_detection: { type: "server_vad" },
});
For the OpenAI Realtime provider, updateConfig sends a session.update event over the existing WebSocket. Changes take effect for subsequent interactions in the same session. For composite voice, updateConfig delegates to the realtime provider.

Manually Triggering a Realtime Response

In realtime (speech-to-speech) mode, OpenAI’s server can detect speech automatically. But you can also trigger a response explicitly with answer():
await voice.answer({
  modalities: ["audio"],
  instructions: "Summarize what was just said",
});
answer() sends a response.create event to the WebSocket. Any options you pass are forwarded as response properties. Call it when you want the model to respond immediately without waiting for voice activity detection.

Overriding the WebSocket URL

The default WebSocket endpoint for OpenAI Realtime is wss://api.openai.com/v1/realtime. Override it with the url config option:
const voice = createOpenAIRealtimeVoice({
  url: "wss://my-proxy.example.com/realtime",
  model: "gpt-4o-mini-realtime-preview-2024-12-17",
});
The model name is appended as a query parameter (?model=...), so the full connection URL becomes wss://my-proxy.example.com/realtime?model=gpt-4o-mini-realtime-preview-2024-12-17. Use this for proxies, local development stubs, or alternative endpoints.

Configuring the Transcription Model

By default, the OpenAI Realtime provider transcribes incoming audio with whisper-1. Change the transcription model with the transcriber config option:
const voice = createOpenAIRealtimeVoice({
  transcriber: "gpt-4o-transcribe",
});
The transcriber is sent to the server as part of the session.update call immediately after connection. It controls how the realtime API transcribes user audio for the input_audio_transcription session property.

Audio Format Support

When calling speak(), you can request a specific audio format via the format option:
const audio = await voice.speak("Hello, world", { format: "opus" });
Supported formats:
FormatDescription
mp3MPEG Layer 3 — widely compatible, lossy
wavWaveform Audio — uncompressed, lossless
pcmRaw PCM — no header, lowest overhead
opusOpus codec — low latency, good for streaming
flacFree Lossless Audio Codec
aacAdvanced Audio Coding — good compression
Not every provider supports every format. If the provider does not support the requested format, it will use its default. The AudioFormat type is exported from smithers-orchestrator/voice for type-safe usage.

Provider-Level Event Callbacks

Realtime voice providers emit events that you can subscribe to with on() and unsubscribe from with off():
const handler = (data) => console.log(data);

voice.on("speaking", handler);   // audio output chunks
voice.on("writing", handler);    // text transcription chunks
voice.on("error", handler);      // provider errors
voice.on("speaker", handler);    // new audio output stream

voice.off("speaking", handler);  // remove a listener
EventPayloadWhen
speaking{ audio, response_id }Each chunk of audio output from the model
writing{ text, role, response_id }Each chunk of text transcription
error{ message, code?, details? }A provider-level error occurred
speakerReadableStreamA new audio response stream was created
These are provider-level events on the voice instance. They are separate from the Smithers event bus events (VoiceStarted, VoiceFinished, VoiceError) which track operation lifecycle at the workflow level.

Default Speaker Selection

If you don’t specify a speaker prop on <Voice> or a speaker option in the provider config, the default depends on the provider:
  • OpenAI Realtime: defaults to "alloy"
  • AI SDK Voice: no default — you must pass a speaker via SpeakOptions or the provider config, or the underlying model’s default is used
  • Composite Voice: delegates to whichever sub-provider handles the operation
You can override the speaker at three levels (highest priority first):
  1. Per-call: voice.speak("text", { speaker: "shimmer" })
  2. Per-component: <Voice provider={voice} speaker="coral">
  3. Per-provider: createOpenAIRealtimeVoice({ speaker: "echo" })

OpenAI Realtime: API Key and Environment Fallback

The OpenAI Realtime provider resolves API keys in this order:
  1. The apiKey config option passed to createOpenAIRealtimeVoice()
  2. The OPENAI_API_KEY environment variable
// Explicit key
const voice = createOpenAIRealtimeVoice({ apiKey: "sk-..." });

// Or rely on the environment variable — no config needed
const voice = createOpenAIRealtimeVoice();
// Uses process.env.OPENAI_API_KEY automatically
If neither is set, connect() throws an error.

OpenAI Realtime: Model Override

Override the realtime model with the model config option:
const voice = createOpenAIRealtimeVoice({
  model: "gpt-4o-realtime-preview",
});
The default is gpt-4o-mini-realtime-preview-2024-12-17. The model name is appended as a query parameter to the WebSocket URL.

OpenAI Realtime: Session Management

The OpenAI Realtime provider manages WebSocket session lifecycle automatically:
  1. connect() opens a WebSocket, waits for the session.created event, then sends an initial session.update to configure the transcription model and default voice.
  2. While connected, any calls to send(), speak(), listen(), or answer() use the active session.
  3. close() tears down the connection, cleans up speaker streams, and releases resources.
Messages sent before the session is ready are automatically queued and flushed once the connection opens. You don’t need to wait for session.created yourself — connect() returns only after the session is fully initialized.
const voice = createOpenAIRealtimeVoice({ speaker: "coral" });

await voice.connect();    // waits for session.created + session.update
await voice.send(audio);  // uses the active session
voice.close();            // tears down cleanly
If you call connect() while already connected, it returns immediately. Concurrent calls to connect() are deduplicated — only one connection attempt runs at a time.

Events and Observability

Voice operations emit structured events:
  • VoiceStarted — a voice operation began (speak or listen)
  • VoiceFinished — it completed successfully
  • VoiceError — something went wrong
These flow through the same event bus as all other Smithers events. The smithers.voice.operations_total counter and smithers.voice.duration_ms histogram track volume and latency.