Chapter 16: Voice Agents #
"The next interface is not a screen -- it is a conversation."
Text-based agents are powerful, but many real-world applications demand voice interaction: customer service hotlines, in-car assistants, accessibility tools, hands-free operation in industrial settings, and smart home devices. Building a voice agent traditionally requires stitching together separate speech-to-text (STT), language model, and text-to-speech (TTS) services, each with its own API, audio format, and error handling.
Neam makes voice a first-class construct. With the voice and realtime_voice
declarations, you define complete voice pipelines in a few lines. The runtime handles
audio encoding, API calls, format conversion, and event streaming. This chapter covers
both batch pipelines (process an audio file end-to-end) and real-time streaming
(full-duplex WebSocket voice with sub-second latency).
16.1 Voice Agent Architecture #
A voice agent consists of three stages connected in series:
STT (Speech-to-Text): Converts audio into a text transcription.
Agent (LLM): Processes the transcribed text and generates a response. This is a standard Neam agent -- it can use tools, knowledge bases, guardrails, and all other agent features.
TTS (Text-to-Speech): Converts the agent's text response into audio.
The beauty of this architecture is that each stage is independently configurable. You can use OpenAI Whisper for STT, Anthropic Claude for the agent, and a local Kokoro instance for TTS -- all in the same pipeline.
16.2 Batch Voice Pipelines #
Batch pipelines process audio files through the STT -> Agent -> TTS flow. They are ideal for processing recorded audio, building voice message systems, or testing voice interactions without a live microphone.
Defining a Batch Pipeline #
// Step 1: Define the agent
agent assistant {
provider: "openai"
model: "gpt-4o-mini"
system: "You are a helpful voice assistant. Keep responses brief and conversational."
}
// Step 2: Define the voice pipeline
voice my_pipeline {
agent: assistant
stt_provider: "whisper"
stt_model: "whisper-1"
tts_provider: "openai"
tts_model: "tts-1"
tts_voice: "alloy"
}
The voice declaration connects an existing agent to STT and TTS providers. The agent
itself is unchanged -- it still processes text. The voice pipeline wraps it with audio
I/O.
Voice Pipeline Fields #
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
agent |
identifier | Yes | -- | The agent to process transcribed text |
stt_provider |
string | No | "whisper" |
STT provider name |
stt_model |
string | No | "whisper-1" |
STT model identifier |
stt_endpoint |
string | No | -- | Custom endpoint URL (local providers) |
stt_language |
string | No | -- | ISO-639-1 language code (e.g., "en", "es") |
stt_format |
string | No | "json" |
Response format |
tts_provider |
string | No | "openai" |
TTS provider name |
tts_model |
string | No | "tts-1" |
TTS model identifier |
tts_voice |
string | No | "alloy" |
Voice name |
tts_endpoint |
string | No | -- | Custom endpoint URL (local providers) |
tts_format |
string | No | "mp3" |
Output audio format |
tts_speed |
string | No | "1.0" |
Playback speed (0.25-4.0, OpenAI only) |
tts_instructions |
string | No | -- | Tone/style instructions (gpt-4o-mini-tts only) |
16.3 Voice Native Functions #
Neam provides three native functions for batch voice pipelines:
voice_transcribe(pipeline, audio_path) #
Transcribes an audio file to text using the pipeline's STT provider.
{
let text = voice_transcribe("my_pipeline", "/tmp/input.wav");
emit "You said: " + text;
}
Parameters:
- pipeline (string): The name of the voice pipeline declaration.
- audio_path (string): Path to the input audio file (WAV, MP3, FLAC, etc.).
Returns: A string containing the transcription.
voice_synthesize(pipeline, text, output_path) #
Synthesizes text into an audio file using the pipeline's TTS provider.
{
let audio_path = voice_synthesize("my_pipeline", "Hello there!", "/tmp/greeting.mp3");
emit "Audio saved to: " + audio_path;
}
Parameters:
- pipeline (string): The name of the voice pipeline declaration.
- text (string): The text to synthesize.
- output_path (string): Path where the output audio file will be saved.
Returns: A string containing the output file path.
voice_pipeline_run(pipeline, input_path, output_path) #
Runs the complete STT -> Agent -> TTS pipeline on an audio file.
{
let result = voice_pipeline_run("my_pipeline", "/tmp/input.wav", "/tmp/output.mp3");
emit "Input: " + result["input_text"];
emit "Response: " + result["response_text"];
emit "Audio: " + result["output_audio"];
}
Parameters:
- pipeline (string): The name of the voice pipeline declaration.
- input_path (string): Path to the input audio file.
- output_path (string): Path where the output audio file will be saved.
Returns: A map with three keys:
- "input_text": The transcribed user speech.
- "response_text": The agent's text response.
- "output_audio": The path to the generated audio file.
16.4 STT Providers #
Neam supports four speech-to-text providers.
OpenAI Whisper (Cloud) #
The default STT provider. Uses OpenAI's Whisper API.
voice whisper_pipeline {
agent: my_agent
stt_provider: "whisper"
stt_model: "whisper-1"
stt_language: "en"
stt_format: "json"
tts_provider: "openai"
tts_model: "tts-1"
tts_voice: "alloy"
}
Available models:
| Model | Cost | Notes |
|---|---|---|
whisper-1 |
$0.006/min | Standard model, 25 MB file limit |
gpt-4o-transcribe |
$0.006/min | Lower word error rate, fewer hallucinations |
gpt-4o-mini-transcribe |
$0.003/min | Half the cost of standard |
Supported formats: WAV, MP3, M4A, FLAC, OGG, WebM (max 25 MB).
Requires: OPENAI_API_KEY environment variable.
Gemini STT (Cloud) #
Uses Google's Gemini models for audio understanding via the generateContent API.
voice gemini_stt_pipeline {
agent: my_agent
stt_provider: "gemini"
stt_model: "gemini-2.0-flash"
tts_provider: "openai"
tts_model: "tts-1"
tts_voice: "alloy"
}
Gemini handles audio natively -- it does not use a separate STT model. The audio is sent as part of the multimodal input, and Gemini produces a text transcription.
Supported formats: WAV, MP3, AIFF, AAC, OGG, FLAC (max 20 MB inline).
Requires: GEMINI_API_KEY environment variable.
Local whisper.cpp #
Uses a locally running whisper.cpp server with an OpenAI-compatible API. No cloud dependency, no API key, and your audio data never leaves your machine.
voice local_stt_pipeline {
agent: my_agent
stt_provider: "whisper-local"
stt_model: "base.en"
stt_endpoint: "http://localhost:8080"
tts_provider: "kokoro"
tts_model: "tts-1"
tts_voice: "af_heart"
tts_endpoint: "http://localhost:8880"
}
Setting up whisper.cpp:
# Clone and build
git clone https://github.com/ggerganov/whisper.cpp && cd whisper.cpp
cmake -B build && cmake --build build --parallel
# Download a model
./build/bin/whisper-cli --download-model base.en
# Start the server
./build/bin/whisper-server -m models/ggml-base.en.bin --port 8080
Available models:
| Model | Size | Speed | Accuracy |
|---|---|---|---|
tiny.en |
75 MB | Fastest | Good for short commands |
base.en |
142 MB | Fast | Good general purpose |
small.en |
466 MB | Medium | Better accuracy |
medium.en |
1.5 GB | Slow | High accuracy |
large-v3 |
3.1 GB | Slowest | Best accuracy, multilingual |
16.5 TTS Providers #
Neam supports five text-to-speech providers.
OpenAI TTS (Cloud) #
voice openai_tts_pipeline {
agent: my_agent
stt_provider: "whisper"
stt_model: "whisper-1"
tts_provider: "openai"
tts_model: "tts-1"
tts_voice: "nova"
tts_format: "mp3"
tts_speed: "1.0"
}
Models:
| Model | Cost | Notes |
|---|---|---|
tts-1 |
$15/1M chars | Low latency, good for real-time |
tts-1-hd |
$30/1M chars | Higher audio quality |
gpt-4o-mini-tts |
~$0.015/min | Best quality, supports tts_instructions |
Voices: alloy, ash, ballad, cedar, coral, echo, fable, marin,
nova, onyx, sage, shimmer, verse
Output formats: MP3, Opus, AAC, FLAC, WAV, PCM
The gpt-4o-mini-tts model supports a tts_instructions field for controlling tone
and style:
voice expressive_pipeline {
agent: my_agent
stt_provider: "whisper"
stt_model: "whisper-1"
tts_provider: "openai"
tts_model: "gpt-4o-mini-tts"
tts_voice: "coral"
tts_instructions: "Speak in a warm, friendly tone with natural pauses."
}
Requires: OPENAI_API_KEY
Gemini TTS (Cloud) #
voice gemini_tts_pipeline {
agent: my_agent
stt_provider: "whisper"
stt_model: "whisper-1"
tts_provider: "gemini"
tts_model: "gemini-2.5-flash-preview-tts"
tts_voice: "Kore"
}
Voices: 30 voices available including Zephyr, Puck, Charon, Kore, and more.
Supports 24 languages and multi-speaker synthesis.
Output: PCM 24kHz 16-bit mono.
Requires: GEMINI_API_KEY
Kokoro (Local) #
Kokoro-FastAPI provides high-quality local TTS with an OpenAI-compatible API.
voice kokoro_pipeline {
agent: my_agent
stt_provider: "whisper"
stt_model: "whisper-1"
tts_provider: "kokoro"
tts_model: "tts-1"
tts_voice: "af_heart"
tts_endpoint: "http://localhost:8880"
}
Starting Kokoro:
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi
No API key required. Audio data stays local.
Piper (Local) #
Piper is a lightweight, fast TTS engine designed for edge deployment.
voice piper_pipeline {
agent: my_agent
stt_provider: "whisper"
stt_model: "whisper-1"
tts_provider: "piper"
tts_model: "tts-1"
tts_voice: "alloy"
tts_endpoint: "http://localhost:5000"
}
Starting Piper:
# Piper runs as a local HTTP server
# See https://github.com/rhasspy/piper for installation instructions
No API key required.
ElevenLabs (Cloud) #
ElevenLabs provides premium voice synthesis with highly realistic voices.
voice elevenlabs_pipeline {
agent: my_agent
stt_provider: "whisper"
stt_model: "whisper-1"
tts_provider: "elevenlabs"
tts_model: "eleven_multilingual_v2"
tts_voice: "Rachel"
tts_endpoint: "https://api.elevenlabs.io"
}
Requires: ELEVENLABS_API_KEY
16.6 Cross-Provider Pipelines #
One of Neam's strengths is the ability to mix and match any STT provider, agent (LLM), and TTS provider in a single pipeline. This lets you optimize each stage independently.
// Gemini STT (good multilingual) -> OpenAI Agent -> Kokoro TTS (free, local)
agent smart_agent {
provider: "openai"
model: "gpt-4o-mini"
system: "You are a multilingual voice assistant."
}
voice cross_provider {
agent: smart_agent
stt_provider: "gemini"
stt_model: "gemini-2.0-flash"
stt_language: "es"
tts_provider: "kokoro"
tts_model: "tts-1"
tts_voice: "af_heart"
tts_endpoint: "http://localhost:8880"
}
{
let result = voice_pipeline_run("cross_provider", "/tmp/spanish.wav", "/tmp/response.wav");
emit "Transcription: " + result["input_text"];
emit "Response: " + result["response_text"];
}
Fully Local Pipeline (No Cloud) #
For maximum privacy and offline operation:
agent local_bot {
provider: "ollama"
model: "llama3"
system: "You are a helpful local assistant. Keep responses under 50 words."
}
voice fully_local {
agent: local_bot
stt_provider: "whisper-local"
stt_model: "base.en"
stt_endpoint: "http://localhost:8080"
tts_provider: "kokoro"
tts_model: "tts-1"
tts_voice: "af_heart"
tts_endpoint: "http://localhost:8880"
}
{
let result = voice_pipeline_run("fully_local", "/tmp/mic.wav", "/tmp/response.wav");
emit "You said: " + result["input_text"];
emit "Response: " + result["response_text"];
}
Prerequisites for the fully local pipeline:
# 1. Start Ollama
ollama pull llama3
ollama serve
# 2. Start whisper.cpp
cd whisper.cpp
./build/bin/whisper-server -m models/ggml-base.en.bin --port 8080
# 3. Start Kokoro
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi
16.7 Real-Time Voice Streaming #
Batch pipelines process complete audio files. Real-time streaming provides full-duplex WebSocket communication for live, interactive voice conversations with sub-second latency.
Defining a Real-Time Voice Agent #
agent rt_assistant {
provider: "openai"
model: "gpt-4o-mini"
system: "You are a helpful voice assistant. Keep responses concise."
}
realtime_voice RealtimeAssistant {
agent: rt_assistant
rt_provider: "openai"
rt_model: "gpt-4o-realtime-preview"
rt_voice: "coral"
rt_vad: "server"
vad_threshold: "0.5"
silence_duration_ms: "500"
input_format: "pcm16"
output_format: "pcm16"
sample_rate: "24000"
rt_speed: "1.0"
}
Real-Time Configuration Fields #
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
agent |
identifier | Yes | -- | The agent for this voice session |
rt_provider |
string | Yes | -- | Provider: "openai", "gemini", "local" |
rt_model |
string | Yes | -- | Real-time model identifier |
rt_voice |
string | No | "alloy" |
Output voice name |
rt_vad |
string | No | "server" |
VAD mode: "server", "client", "manual" |
vad_threshold |
string | No | "0.5" |
VAD sensitivity (0.0-1.0) |
silence_duration_ms |
string | No | "500" |
Silence before end-of-turn (ms) |
input_format |
string | No | "pcm16" |
Input audio format |
output_format |
string | No | "pcm16" |
Output format: pcm16, g711_ulaw, g711_alaw |
sample_rate |
string | No | "24000" |
Audio sample rate in Hz |
rt_speed |
string | No | "1.0" |
Playback speed (OpenAI: 0.25-1.5) |
16.8 Real-Time Functions #
Connection Management #
{
// Open a WebSocket session
let session = realtime_connect("RealtimeAssistant");
// ... interact with the session ...
// Close the session
realtime_close(session);
}
Sending Data #
{
let session = realtime_connect("RealtimeAssistant");
// Send a text message (agent responds with audio + text)
realtime_send_text(session, "Hello, how are you?");
// Send audio data (base64-encoded PCM)
let audio_data = crypto_base64_encode(file_read_string("/tmp/audio.pcm"));
realtime_send_audio(session, audio_data);
// Signal end of user's turn (client/manual VAD mode)
voice_commit_audio(session);
// Clear the audio buffer without committing
voice_clear_audio(session);
// Explicitly request a model response
voice_request_response(session);
realtime_close(session);
}
Event Handling #
Register callbacks to handle events from the server:
{
let session = realtime_connect("RealtimeAssistant");
// Register event handlers (use 0 for default print handler)
realtime_on(session, "transcript", 0); // User speech transcription
realtime_on(session, "response_text", 0); // Agent text response
realtime_on(session, "response_audio", 0); // Agent audio chunks
realtime_on(session, "tool_call", 0); // Function calls from model
realtime_on(session, "error", 0); // Error messages
realtime_send_text(session, "Tell me a joke");
time_sleep(5000); // Wait for response
realtime_close(session);
}
Event output format with default handler:
[transcript] What the user said
[response] Here is a joke for you...
[tool_call] get_weather({"city": "Paris"}) id=call_abc123
[error] Connection timeout
Interruption (Barge-In) #
{
let session = realtime_connect("RealtimeAssistant");
realtime_on(session, "response_text", 0);
// Ask for a long response
realtime_send_text(session, "Tell me a long story about a dragon");
time_sleep(1000);
// Interrupt mid-response
realtime_interrupt(session);
// Send a new message
realtime_send_text(session, "Actually, just say hello");
time_sleep(5000);
realtime_close(session);
}
Tool Call Handling #
When the model invokes a function during a real-time session:
{
let session = realtime_connect("RealtimeAssistant");
realtime_on(session, "tool_call", 0);
// When a tool_call event fires, send the result back
// tool_call format: function_name(args) id=call_id
realtime_tool_result(session, "call_abc123", "{\"temperature\": 22}");
realtime_close(session);
}
Session Status #
{
let session = realtime_connect("RealtimeAssistant");
let status = voice_session_status(session);
emit "Session status: " + str(status);
// Returns: "connected", "disconnected", or error details
realtime_close(session);
}
Complete Real-Time Functions Reference #
| Function | Args | Description |
|---|---|---|
realtime_connect(config_name) |
1 | Open WebSocket session, returns session ID |
realtime_send_text(session, text) |
2 | Send text message to agent |
realtime_send_audio(session, b64_pcm) |
2 | Send base64-encoded PCM audio chunk |
realtime_on(session, event, callback) |
3 | Register event handler (use 0 for default) |
realtime_interrupt(session) |
1 | Cancel current response (barge-in) |
realtime_close(session) |
1 | Close WebSocket session |
realtime_tool_result(session, call_id, result) |
3 | Send function call output |
voice_commit_audio(session) |
1 | Commit audio buffer as user turn |
voice_clear_audio(session) |
1 | Clear audio buffer |
voice_request_response(session) |
1 | Explicitly request a model response |
voice_session_status(session) |
1 | Get session status |
16.9 VAD (Voice Activity Detection) #
Voice Activity Detection determines when the user has finished speaking and the agent should respond. Neam supports three VAD modes:
Server VAD (Recommended) #
The provider's server detects speech start and end automatically. This is the simplest and most reliable mode.
realtime_voice server_vad_example {
agent: my_agent
rt_provider: "openai"
rt_model: "gpt-4o-realtime-preview"
rt_voice: "coral"
rt_vad: "server"
vad_threshold: "0.5"
silence_duration_ms: "500"
}
Tuning parameters:
vad_threshold(0.0-1.0): How sensitive the detector is to speech.0.3= Very sensitive. Picks up quiet speech but may trigger on background noise.0.5= Balanced. Good default for most environments.-
0.8= Less sensitive. Requires clear, loud speech. Good for noisy environments. -
silence_duration_ms: How long the system waits after silence before treating the turn as complete. 300= Fast response. May cut off users who pause mid-sentence.500= Balanced. Good default.1000= Patient. Waits for longer pauses. Good for complex questions.
Client VAD #
The client application manages turn detection. Use voice_commit_audio() to signal
when the user has finished speaking.
realtime_voice client_vad_example {
agent: my_agent
rt_provider: "openai"
rt_model: "gpt-4o-realtime-preview"
rt_voice: "echo"
rt_vad: "client"
}
{
let session = realtime_connect("client_vad_example");
realtime_on(session, "response_text", 0);
// Send audio chunks
let audio = crypto_base64_encode(file_read_string("/tmp/speech.pcm"));
realtime_send_audio(session, audio);
// Signal end of turn (client decides when user is done)
voice_commit_audio(session);
time_sleep(5000);
realtime_close(session);
}
Manual VAD #
No automatic detection. You must explicitly commit audio and request responses.
realtime_voice manual_vad_example {
agent: my_agent
rt_provider: "openai"
rt_model: "gpt-4o-realtime-preview"
rt_voice: "sage"
rt_vad: "manual"
}
{
let session = realtime_connect("manual_vad_example");
// Send audio
realtime_send_audio(session, audio_data);
// Commit the audio buffer
voice_commit_audio(session);
// Explicitly request a response
voice_request_response(session);
time_sleep(5000);
realtime_close(session);
}
16.10 Real-Time Provider Comparison #
OpenAI Realtime #
realtime_voice openai_rt {
agent: my_agent
rt_provider: "openai"
rt_model: "gpt-4o-realtime-preview"
rt_voice: "coral"
rt_vad: "server"
vad_threshold: "0.5"
silence_duration_ms: "500"
input_format: "pcm16"
output_format: "pcm16"
sample_rate: "24000"
rt_speed: "1.0"
}
Models:
| Model | Text Input | Audio Input | Audio Output |
|---|---|---|---|
gpt-4o-realtime-preview |
$4/1M tok | $32/1M tok (~$0.06/min) | $64/1M tok (~$0.24/min) |
gpt-realtime-mini |
$0.60/1M tok | $10/1M tok | $20/1M tok |
Audio format: PCM16, 24kHz, mono, little-endian (base64-encoded).
Session duration: Unlimited.
Gemini Live #
realtime_voice gemini_live {
agent: my_agent
rt_provider: "gemini"
rt_model: "gemini-2.0-flash-live-001"
rt_voice: "Puck"
rt_vad: "server"
vad_threshold: "0.5"
silence_duration_ms: "500"
input_format: "pcm16"
output_format: "pcm16"
sample_rate: "24000"
rt_speed: "1.0"
}
Audio format: PCM16, 16kHz mono input / 24kHz mono output.
Session limit: Approximately 10 minutes per WebSocket connection (auto-resumable with handle tokens).
Voices: 30 voices available: Zephyr, Puck, Charon, Kore, and more. Supports
24 languages.
Local Streaming #
Combines whisper.cpp (STT) + Ollama (LLM) + Kokoro/Piper (TTS) for fully local real-time voice.
agent local_llm {
provider: "ollama"
model: "llama3"
system: "You are a helpful assistant. Keep responses concise."
}
realtime_voice local_stream {
agent: local_llm
rt_provider: "local"
rt_model: "llama3"
rt_voice: "af_heart"
rt_vad: "server"
vad_threshold: "0.5"
silence_duration_ms: "500"
input_format: "pcm16"
output_format: "pcm16"
sample_rate: "16000"
rt_speed: "1.0"
rt_stt_endpoint: "http://localhost:8080"
rt_tts_endpoint: "http://localhost:8880"
rt_llm_endpoint: "http://localhost:11434"
}
How local streaming works:
- Audio segments are sent to whisper.cpp for transcription.
- LLM response tokens stream from Ollama.
- Complete sentences are synthesized via Kokoro/Piper incrementally.
- Audio chunks are delivered as they become ready.
Provider Comparison Matrix #
| Feature | OpenAI Realtime | Gemini Live | Local Pipeline |
|---|---|---|---|
| Protocol | WebSocket | WebSocket | REST (chunked) |
| Latency | Ultra-low | Ultra-low | Low-medium |
| Input Audio | PCM16 24kHz | PCM16 16kHz | PCM16 16kHz |
| Output Audio | PCM16/g711 24kHz | PCM16 24kHz | WAV/MP3 |
| VAD | Server (configurable) | Auto activity detection | Client-side |
| Barge-in | Yes (interrupt) | Yes (automatic) | Manual |
| Function Calling | During stream | During stream | Via LLM |
| Session Duration | Unlimited | ~10 min (resumable) | Per-connection |
| Multi-speaker | No | Yes (via config) | No |
| Cost | $0.06-0.24/min | Varies by model | Free (local) |
| API Key | Yes | Yes | No |
16.11 Barge-In and Interruption Handling #
Barge-in is the ability for a user to interrupt the agent while it is speaking. This is essential for natural conversation flow.
OpenAI Realtime #
Use realtime_interrupt() to cancel the current response:
{
let session = realtime_connect("openai_rt");
realtime_on(session, "response_text", 0);
realtime_send_text(session, "Explain quantum physics in detail");
time_sleep(2000); // Let agent start responding
// User interrupts
realtime_interrupt(session);
emit "[Interrupted]";
// New question
realtime_send_text(session, "Just give me a one-sentence summary");
time_sleep(5000);
realtime_close(session);
}
Gemini Live #
Gemini handles barge-in automatically via activity detection. When the server detects new user speech while the agent is responding, it stops the current response and processes the new input.
Local Pipeline #
Barge-in in local streaming is manual. Send a new audio segment while a response is being generated, and the system will interrupt the current TTS output.
16.12 Practical Example: Voice Assistant with macOS Microphone #
This example demonstrates a complete voice assistant workflow on macOS:
agent voice_bot {
provider: "openai"
model: "gpt-4o-mini"
system: "You are a voice assistant. Keep responses under 50 words.
Be conversational and friendly."
}
voice assistant {
agent: voice_bot
stt_provider: "whisper"
stt_model: "whisper-1"
tts_provider: "openai"
tts_model: "tts-1"
tts_voice: "nova"
}
{
emit "=== Voice Assistant ===";
emit "Recording from microphone...";
emit "(Record audio to /tmp/mic.wav first, then run this program)";
emit "";
let result = voice_pipeline_run("assistant", "/tmp/mic.wav", "/tmp/response.wav");
emit "You said: " + result["input_text"];
emit "Response: " + result["response_text"];
emit "Audio saved to: " + result["output_audio"];
}
Recording audio on macOS:
# Install SoX for command-line recording (one-time)
brew install sox
# Record (press Ctrl+C to stop)
rec -r 16000 -c 1 -b 16 /tmp/mic.wav
# Compile and run the voice assistant
./neamc voice_assistant.neam -o voice_assistant.neamb
./neam voice_assistant.neamb
# Play the response (afplay is built into macOS)
afplay /tmp/response.wav
16.13 Practical Example: Knowledge-Augmented Voice Agent #
Combine voice with RAG for a voice-enabled documentation assistant:
knowledge docs {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [
{ type: "file", path: "./README.md" }
]
retrieval_strategy: "basic"
top_k: 3
}
agent doc_bot {
provider: "openai"
model: "gpt-4o-mini"
system: "You are a documentation assistant. Answer using only the provided
context. Keep responses under 3 sentences for voice."
connected_knowledge: [docs]
}
voice doc_voice {
agent: doc_bot
stt_provider: "whisper"
stt_model: "whisper-1"
tts_provider: "openai"
tts_model: "tts-1"
tts_voice: "alloy"
}
{
let result = voice_pipeline_run("doc_voice", "/tmp/question.wav", "/tmp/answer.wav");
emit "Question: " + result["input_text"];
emit "Answer: " + result["response_text"];
}
16.14 Voice Standard Library Modules #
Beyond the native functions and declarations, Neam's standard library provides a rich set of modules for building production voice systems. These modules handle concerns that go beyond basic STT/TTS, including session management, transcripts, budgets, and compliance.
Voice Agent Builder (std.voice.agent) #
The voice agent builder provides a higher-level API for constructing voice agents with configurable policies and behaviors:
import std::voice::agent;
let config = agent::create_voice_agent({
"agent": "voice_bot",
"pipeline": "assistant",
"max_turns": 20,
"greeting": "Hello! How can I help you today?",
"farewell": "Goodbye! Have a great day.",
"idle_timeout_ms": 30000,
"error_message": "I'm sorry, I didn't catch that. Could you repeat?"
});
Session Management (std.voice.session) #
Voice sessions track the lifecycle of a conversation, including connection state, turn history, and accumulated context:
import std::voice::session;
let sess = session::create_session("RealtimeAssistant");
let status = session::get_status(sess); // "connected", "idle", "active"
let duration = session::get_duration(sess); // elapsed time in ms
let turn_count = session::get_turn_count(sess);
session::end_session(sess);
Transcript Handling (std.voice.transcript) #
The transcript module captures and structures the full conversation history, useful for logging, analysis, and compliance:
import std::voice::transcript;
let tx = transcript::create_transcript();
tx = transcript::add_turn(tx, "user", "What's the weather today?");
tx = transcript::add_turn(tx, "agent", "It's sunny and 72 degrees.");
let full_text = transcript::to_text(tx);
let json_export = transcript::to_json(tx);
Voice Budgets (std.voice.budget) #
Voice interactions can be expensive. The budget module tracks costs and enforces limits across token usage, API calls, and wall-clock time:
import std::voice::budget;
let voice_budget = budget::create_budget({
"max_cost_usd": 5.00,
"max_turns": 50,
"max_duration_ms": 600000
});
let remaining = budget::get_remaining(voice_budget);
let is_exceeded = budget::is_exceeded(voice_budget);
Additional Voice Modules #
| Module | Purpose |
|---|---|
std.voice.turn |
Turn-level management (start, end, metadata) |
std.voice.policy |
Conversation policies (max silence, retry limits) |
std.voice.metrics |
Latency, word count, turn duration tracking |
std.voice.audit |
Audit logging for compliance and debugging |
std.voice.consent |
User consent management for recording |
std.voice.taint |
Data tainting for sensitive audio content |
std.voice.tracing |
Distributed tracing for voice pipelines |
std.voice.playback |
Audio playback utilities |
16.15 Audio Utilities #
The standard library provides low-level audio utilities (std.audio) for working
with audio data directly, which is useful for custom voice pipelines and pre/post
processing.
Audio Resampling #
When mixing providers that use different sample rates (e.g., 16kHz whisper.cpp input with 24kHz OpenAI Realtime output), use the built-in resample function:
{
let resampled = voice_audio_resample(pcm_data, 16000, 24000);
}
Audio Buffer and Codec Modules #
| Module | Purpose |
|---|---|
std.audio.buffer |
Ring buffers for streaming audio, append/read/drain operations |
std.audio.codec |
Encode and decode between PCM, WAV, MP3, Opus, FLAC |
std.audio.stream |
Streaming audio source/sink abstractions |
std.audio.meter |
Real-time audio level metering (RMS, peak, silence detection) |
import std::audio::meter;
let level = meter::get_rms(audio_chunk);
let is_silent = meter::is_silent(audio_chunk, -40.0); // threshold in dB
Audio metering is particularly useful for implementing custom VAD logic when the built-in server/client/manual modes do not fit your use case.
16.16 Speech Configuration in the Standard Library #
The std.speech modules provide detailed configuration for STT and TTS beyond what the
voice and realtime_voice declarations expose directly.
STT Configuration (std.speech.stt) #
import std::speech::stt;
let stt_config = stt::create_config({
"language": "en-US",
"punctuation": true,
"profanity_filter": false,
"diarization": true, // speaker identification
"word_timestamps": true
});
Endpointing modes control how the STT engine detects the end of an utterance:
| Mode | Description |
|---|---|
"aggressive" |
Quick end-of-speech detection, responsive but may cut off |
"normal" |
Balanced detection (default) |
"conservative" |
Waits longer before ending, good for complex speech |
"manual" |
No automatic detection; caller controls |
TTS Configuration (std.speech.tts) #
import std::speech::tts;
let tts_config = tts::create_config({
"voice": "nova",
"language": "en-US",
"speed": 1.0,
"pitch": 0.0,
"volume": 1.0,
"style": "conversational"
});
The TTS module also supports SSML (Speech Synthesis Markup Language) for fine-grained control over pronunciation, pauses, and emphasis:
let ssml_text = "<speak>Welcome to <emphasis level='strong'>Neam</emphasis>. " +
"<break time='500ms'/> How can I help you today?</speak>";
SSML support varies by provider. OpenAI TTS uses tts_instructions
instead of SSML markup. Gemini and ElevenLabs support SSML natively.
Summary #
In this chapter you learned:
- Voice agent architecture: The three-stage STT -> Agent -> TTS pipeline.
- Batch pipelines: Processing audio files end-to-end with the
voicedeclaration. - Voice functions:
voice_transcribe(),voice_synthesize(), andvoice_pipeline_run()for batch operations. - STT providers: OpenAI Whisper, Gemini, and local whisper.cpp.
- TTS providers: OpenAI, Gemini, Kokoro, Piper, and ElevenLabs.
- Cross-provider pipelines: Mixing any STT, agent, and TTS provider.
- Real-time streaming: Full-duplex WebSocket voice with
realtime_voice. - VAD configuration: Server, client, and manual voice activity detection modes.
- Barge-in handling: Interrupting the agent mid-response.
- Provider comparison: Trade-offs between OpenAI Realtime, Gemini Live, and local streaming.
- Voice standard library modules: Session management, transcript handling, voice budgets, turn management, policies, metrics, audit logging, and consent tracking.
- Audio utilities: Resampling between sample rates, audio buffers, codecs, streaming abstractions, and real-time metering for custom VAD.
- Speech configuration: Fine-grained STT settings (diarization, endpointing modes) and TTS settings (pitch, speed, style, SSML support).
Exercises #
Exercise 16.1: Basic Voice Pipeline #
Create a batch voice pipeline using OpenAI Whisper for STT and OpenAI TTS. Record a short audio clip (or use a sample WAV file), run it through the pipeline, and verify that the transcription, response, and synthesized audio are all correct.
Exercise 16.2: Local Voice Setup #
Set up a fully local voice pipeline using whisper.cpp for STT, Ollama for the agent, and Kokoro for TTS. Document the setup process, including starting each service. Run the same audio clip from Exercise 16.1 and compare the results.
Exercise 16.3: Cross-Provider Pipeline #
Create a pipeline that uses Gemini for STT, OpenAI for the agent, and Kokoro for TTS. Test it with audio in a non-English language (e.g., Spanish, French, or Japanese). How does the transcription quality compare to OpenAI Whisper for that language?
Exercise 16.4: VAD Sensitivity Tuning #
Create two real-time voice configurations with different VAD settings:
- "Sensitive": vad_threshold: "0.3", silence_duration_ms: "300"
- "Relaxed": vad_threshold: "0.8", silence_duration_ms: "1000"
Test both with the same conversation. Document the differences in user experience. Under what conditions would you prefer each setting?
Exercise 16.5: Voice-Enabled RAG Bot #
Build a voice agent that uses a RAG knowledge base to answer questions about a topic of your choice. The pipeline should: (a) transcribe a question from audio, (b) retrieve relevant context, (c) generate an answer, and (d) synthesize the answer to audio. Test with at least three different questions.
Exercise 16.6: Real-Time Conversation #
Using OpenAI Realtime or Gemini Live, implement a real-time voice session that: 1. Connects to the provider. 2. Sends a greeting via text. 3. Handles the response events. 4. Sends a follow-up question. 5. Demonstrates barge-in (interrupt the agent mid-response). 6. Cleanly closes the session.
Document the latency you observe at each step.