📖 11 min read

Chapter 16: Voice Agents #

"The next interface is not a screen -- it is a conversation."

Text-based agents are powerful, but many real-world applications demand voice interaction: customer service hotlines, in-car assistants, accessibility tools, hands-free operation in industrial settings, and smart home devices. Building a voice agent traditionally requires stitching together separate speech-to-text (STT), language model, and text-to-speech (TTS) services, each with its own API, audio format, and error handling.

Neam makes voice a first-class construct. With the voice and realtime_voice declarations, you define complete voice pipelines in a few lines. The runtime handles audio encoding, API calls, format conversion, and event streaming. This chapter covers both batch pipelines (process an audio file end-to-end) and real-time streaming (full-duplex WebSocket voice with sub-second latency).

16.1 Voice Agent Architecture #

A voice agent consists of three stages connected in series:

Audio Input

STT

(Speech

to Text)

▶

Agent

(LLM)

▶

TTS

(Text to

Speech)

STT (Speech-to-Text): Converts audio into a text transcription.

Agent (LLM): Processes the transcribed text and generates a response. This is a standard Neam agent -- it can use tools, knowledge bases, guardrails, and all other agent features.

TTS (Text-to-Speech): Converts the agent's text response into audio.

The beauty of this architecture is that each stage is independently configurable. You can use OpenAI Whisper for STT, Anthropic Claude for the agent, and a local Kokoro instance for TTS -- all in the same pipeline.

16.2 Batch Voice Pipelines #

Batch pipelines process audio files through the STT -> Agent -> TTS flow. They are ideal for processing recorded audio, building voice message systems, or testing voice interactions without a live microphone.

Defining a Batch Pipeline #

neam

// Step 1: Define the agent
agent assistant {
  provider: "openai"
  model: "gpt-4o-mini"
  system: "You are a helpful voice assistant. Keep responses brief and conversational."
}

// Step 2: Define the voice pipeline
voice my_pipeline {
  agent: assistant
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "openai"
  tts_model: "tts-1"
  tts_voice: "alloy"
}

The voice declaration connects an existing agent to STT and TTS providers. The agent itself is unchanged -- it still processes text. The voice pipeline wraps it with audio I/O.

Voice Pipeline Fields #

Field	Type	Required	Default	Description
`agent`	identifier	Yes	--	The agent to process transcribed text
`stt_provider`	string	No	`"whisper"`	STT provider name
`stt_model`	string	No	`"whisper-1"`	STT model identifier
`stt_endpoint`	string	No	--	Custom endpoint URL (local providers)
`stt_language`	string	No	--	ISO-639-1 language code (e.g., `"en"`, `"es"`)
`stt_format`	string	No	`"json"`	Response format
`tts_provider`	string	No	`"openai"`	TTS provider name
`tts_model`	string	No	`"tts-1"`	TTS model identifier
`tts_voice`	string	No	`"alloy"`	Voice name
`tts_endpoint`	string	No	--	Custom endpoint URL (local providers)
`tts_format`	string	No	`"mp3"`	Output audio format
`tts_speed`	string	No	`"1.0"`	Playback speed (0.25-4.0, OpenAI only)
`tts_instructions`	string	No	--	Tone/style instructions (gpt-4o-mini-tts only)

16.3 Voice Native Functions #

Neam provides three native functions for batch voice pipelines:

`voice_transcribe(pipeline, audio_path)` #

Transcribes an audio file to text using the pipeline's STT provider.

neam

{
  let text = voice_transcribe("my_pipeline", "/tmp/input.wav");
  emit "You said: " + text;
}

Parameters: - pipeline (string): The name of the voice pipeline declaration. - audio_path (string): Path to the input audio file (WAV, MP3, FLAC, etc.).

Returns: A string containing the transcription.

`voice_synthesize(pipeline, text, output_path)` #

Synthesizes text into an audio file using the pipeline's TTS provider.

neam

{
  let audio_path = voice_synthesize("my_pipeline", "Hello there!", "/tmp/greeting.mp3");
  emit "Audio saved to: " + audio_path;
}

Parameters: - pipeline (string): The name of the voice pipeline declaration. - text (string): The text to synthesize. - output_path (string): Path where the output audio file will be saved.

Returns: A string containing the output file path.

`voice_pipeline_run(pipeline, input_path, output_path)` #

Runs the complete STT -> Agent -> TTS pipeline on an audio file.

neam

{
  let result = voice_pipeline_run("my_pipeline", "/tmp/input.wav", "/tmp/output.mp3");
  emit "Input:    " + result["input_text"];
  emit "Response: " + result["response_text"];
  emit "Audio:    " + result["output_audio"];
}

Parameters: - pipeline (string): The name of the voice pipeline declaration. - input_path (string): Path to the input audio file. - output_path (string): Path where the output audio file will be saved.

Returns: A map with three keys: - "input_text": The transcribed user speech. - "response_text": The agent's text response. - "output_audio": The path to the generated audio file.

16.4 STT Providers #

Neam supports four speech-to-text providers.

OpenAI Whisper (Cloud) #

The default STT provider. Uses OpenAI's Whisper API.

neam

voice whisper_pipeline {
  agent: my_agent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  stt_language: "en"
  stt_format: "json"
  tts_provider: "openai"
  tts_model: "tts-1"
  tts_voice: "alloy"
}

Available models:

Model	Cost	Notes
`whisper-1`	$0.006/min	Standard model, 25 MB file limit
`gpt-4o-transcribe`	$0.006/min	Lower word error rate, fewer hallucinations
`gpt-4o-mini-transcribe`	$0.003/min	Half the cost of standard

Supported formats: WAV, MP3, M4A, FLAC, OGG, WebM (max 25 MB).

Requires: OPENAI_API_KEY environment variable.

Gemini STT (Cloud) #

Uses Google's Gemini models for audio understanding via the generateContent API.

neam

voice gemini_stt_pipeline {
  agent: my_agent
  stt_provider: "gemini"
  stt_model: "gemini-2.0-flash"
  tts_provider: "openai"
  tts_model: "tts-1"
  tts_voice: "alloy"
}

Gemini handles audio natively -- it does not use a separate STT model. The audio is sent as part of the multimodal input, and Gemini produces a text transcription.

Supported formats: WAV, MP3, AIFF, AAC, OGG, FLAC (max 20 MB inline).

Requires: GEMINI_API_KEY environment variable.

Local whisper.cpp #

Uses a locally running whisper.cpp server with an OpenAI-compatible API. No cloud dependency, no API key, and your audio data never leaves your machine.

neam

voice local_stt_pipeline {
  agent: my_agent
  stt_provider: "whisper-local"
  stt_model: "base.en"
  stt_endpoint: "http://localhost:8080"
  tts_provider: "kokoro"
  tts_model: "tts-1"
  tts_voice: "af_heart"
  tts_endpoint: "http://localhost:8880"
}

Setting up whisper.cpp:

bash

# Clone and build
git clone https://github.com/ggerganov/whisper.cpp && cd whisper.cpp
cmake -B build && cmake --build build --parallel

# Download a model
./build/bin/whisper-cli --download-model base.en

# Start the server
./build/bin/whisper-server -m models/ggml-base.en.bin --port 8080

Available models:

Model	Size	Speed	Accuracy
`tiny.en`	75 MB	Fastest	Good for short commands
`base.en`	142 MB	Fast	Good general purpose
`small.en`	466 MB	Medium	Better accuracy
`medium.en`	1.5 GB	Slow	High accuracy
`large-v3`	3.1 GB	Slowest	Best accuracy, multilingual

16.5 TTS Providers #

Neam supports five text-to-speech providers.

OpenAI TTS (Cloud) #

neam

voice openai_tts_pipeline {
  agent: my_agent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "openai"
  tts_model: "tts-1"
  tts_voice: "nova"
  tts_format: "mp3"
  tts_speed: "1.0"
}

Models:

Model	Cost	Notes
`tts-1`	$15/1M chars	Low latency, good for real-time
`tts-1-hd`	$30/1M chars	Higher audio quality
`gpt-4o-mini-tts`	~$0.015/min	Best quality, supports `tts_instructions`

Voices: alloy, ash, ballad, cedar, coral, echo, fable, marin, nova, onyx, sage, shimmer, verse

Output formats: MP3, Opus, AAC, FLAC, WAV, PCM

The gpt-4o-mini-tts model supports a tts_instructions field for controlling tone and style:

neam

voice expressive_pipeline {
  agent: my_agent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "openai"
  tts_model: "gpt-4o-mini-tts"
  tts_voice: "coral"
  tts_instructions: "Speak in a warm, friendly tone with natural pauses."
}

Requires: OPENAI_API_KEY

Gemini TTS (Cloud) #

neam

voice gemini_tts_pipeline {
  agent: my_agent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "gemini"
  tts_model: "gemini-2.5-flash-preview-tts"
  tts_voice: "Kore"
}

Voices: 30 voices available including Zephyr, Puck, Charon, Kore, and more. Supports 24 languages and multi-speaker synthesis.

Output: PCM 24kHz 16-bit mono.

Requires: GEMINI_API_KEY

Kokoro (Local) #

Kokoro-FastAPI provides high-quality local TTS with an OpenAI-compatible API.

neam

voice kokoro_pipeline {
  agent: my_agent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "kokoro"
  tts_model: "tts-1"
  tts_voice: "af_heart"
  tts_endpoint: "http://localhost:8880"
}

Starting Kokoro:

bash

docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi

No API key required. Audio data stays local.

Piper (Local) #

Piper is a lightweight, fast TTS engine designed for edge deployment.

neam

voice piper_pipeline {
  agent: my_agent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "piper"
  tts_model: "tts-1"
  tts_voice: "alloy"
  tts_endpoint: "http://localhost:5000"
}

Starting Piper:

bash

# Piper runs as a local HTTP server
# See https://github.com/rhasspy/piper for installation instructions

No API key required.

ElevenLabs (Cloud) #

ElevenLabs provides premium voice synthesis with highly realistic voices.

neam

voice elevenlabs_pipeline {
  agent: my_agent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "elevenlabs"
  tts_model: "eleven_multilingual_v2"
  tts_voice: "Rachel"
  tts_endpoint: "https://api.elevenlabs.io"
}

Requires: ELEVENLABS_API_KEY

16.6 Cross-Provider Pipelines #

One of Neam's strengths is the ability to mix and match any STT provider, agent (LLM), and TTS provider in a single pipeline. This lets you optimize each stage independently.

neam

// Gemini STT (good multilingual) -> OpenAI Agent -> Kokoro TTS (free, local)
agent smart_agent {
  provider: "openai"
  model: "gpt-4o-mini"
  system: "You are a multilingual voice assistant."
}

voice cross_provider {
  agent: smart_agent
  stt_provider: "gemini"
  stt_model: "gemini-2.0-flash"
  stt_language: "es"
  tts_provider: "kokoro"
  tts_model: "tts-1"
  tts_voice: "af_heart"
  tts_endpoint: "http://localhost:8880"
}

{
  let result = voice_pipeline_run("cross_provider", "/tmp/spanish.wav", "/tmp/response.wav");
  emit "Transcription: " + result["input_text"];
  emit "Response: " + result["response_text"];
}

Fully Local Pipeline (No Cloud) #

For maximum privacy and offline operation:

neam

agent local_bot {
  provider: "ollama"
  model: "llama3"
  system: "You are a helpful local assistant. Keep responses under 50 words."
}

voice fully_local {
  agent: local_bot
  stt_provider: "whisper-local"
  stt_model: "base.en"
  stt_endpoint: "http://localhost:8080"
  tts_provider: "kokoro"
  tts_model: "tts-1"
  tts_voice: "af_heart"
  tts_endpoint: "http://localhost:8880"
}

{
  let result = voice_pipeline_run("fully_local", "/tmp/mic.wav", "/tmp/response.wav");
  emit "You said: " + result["input_text"];
  emit "Response: " + result["response_text"];
}

Prerequisites for the fully local pipeline:

bash

# 1. Start Ollama
ollama pull llama3
ollama serve

# 2. Start whisper.cpp
cd whisper.cpp
./build/bin/whisper-server -m models/ggml-base.en.bin --port 8080

# 3. Start Kokoro
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi

16.7 Real-Time Voice Streaming #

Batch pipelines process complete audio files. Real-time streaming provides full-duplex WebSocket communication for live, interactive voice conversations with sub-second latency.

PCM Audio

Encoder

▶

PCM Audio

Decoder

▼

WebSocket Connection

Client Events: Server Events:

- audio chunk ------> - transcript

- text message ------> - response_text

- commit_audio ------> - response_audio

- interrupt ------> - tool_call

- error

▼

OpenAI

Realtime

API

▶

Gemini

Live

API

▶

Local

Pipeline

(whisper+

Ollama+

Kokoro)

Defining a Real-Time Voice Agent #

neam

agent rt_assistant {
  provider: "openai"
  model: "gpt-4o-mini"
  system: "You are a helpful voice assistant. Keep responses concise."
}

realtime_voice RealtimeAssistant {
  agent: rt_assistant
  rt_provider: "openai"
  rt_model: "gpt-4o-realtime-preview"
  rt_voice: "coral"
  rt_vad: "server"
  vad_threshold: "0.5"
  silence_duration_ms: "500"
  input_format: "pcm16"
  output_format: "pcm16"
  sample_rate: "24000"
  rt_speed: "1.0"
}

Real-Time Configuration Fields #

Field	Type	Required	Default	Description
`agent`	identifier	Yes	--	The agent for this voice session
`rt_provider`	string	Yes	--	Provider: `"openai"`, `"gemini"`, `"local"`
`rt_model`	string	Yes	--	Real-time model identifier
`rt_voice`	string	No	`"alloy"`	Output voice name
`rt_vad`	string	No	`"server"`	VAD mode: `"server"`, `"client"`, `"manual"`
`vad_threshold`	string	No	`"0.5"`	VAD sensitivity (0.0-1.0)
`silence_duration_ms`	string	No	`"500"`	Silence before end-of-turn (ms)
`input_format`	string	No	`"pcm16"`	Input audio format
`output_format`	string	No	`"pcm16"`	Output format: `pcm16`, `g711_ulaw`, `g711_alaw`
`sample_rate`	string	No	`"24000"`	Audio sample rate in Hz
`rt_speed`	string	No	`"1.0"`	Playback speed (OpenAI: 0.25-1.5)

16.8 Real-Time Functions #

Connection Management #

neam

{
  // Open a WebSocket session
  let session = realtime_connect("RealtimeAssistant");

  // ... interact with the session ...

  // Close the session
  realtime_close(session);
}

Sending Data #

neam

{
  let session = realtime_connect("RealtimeAssistant");

  // Send a text message (agent responds with audio + text)
  realtime_send_text(session, "Hello, how are you?");

  // Send audio data (base64-encoded PCM)
  let audio_data = crypto_base64_encode(file_read_string("/tmp/audio.pcm"));
  realtime_send_audio(session, audio_data);

  // Signal end of user's turn (client/manual VAD mode)
  voice_commit_audio(session);

  // Clear the audio buffer without committing
  voice_clear_audio(session);

  // Explicitly request a model response
  voice_request_response(session);

  realtime_close(session);
}

Event Handling #

neam

{
  let session = realtime_connect("RealtimeAssistant");

  // Register event handlers (use 0 for default print handler)
  realtime_on(session, "transcript", 0);      // User speech transcription
  realtime_on(session, "response_text", 0);   // Agent text response
  realtime_on(session, "response_audio", 0);  // Agent audio chunks
  realtime_on(session, "tool_call", 0);       // Function calls from model
  realtime_on(session, "error", 0);           // Error messages

  realtime_send_text(session, "Tell me a joke");
  time_sleep(5000);  // Wait for response

  realtime_close(session);
}

Event output format with default handler:

text

[transcript] What the user said
[response] Here is a joke for you...
[tool_call] get_weather({"city": "Paris"}) id=call_abc123
[error] Connection timeout

Interruption (Barge-In) #

neam

{
  let session = realtime_connect("RealtimeAssistant");
  realtime_on(session, "response_text", 0);

  // Ask for a long response
  realtime_send_text(session, "Tell me a long story about a dragon");
  time_sleep(1000);

  // Interrupt mid-response
  realtime_interrupt(session);

  // Send a new message
  realtime_send_text(session, "Actually, just say hello");
  time_sleep(5000);

  realtime_close(session);
}

Tool Call Handling #

When the model invokes a function during a real-time session:

neam

{
  let session = realtime_connect("RealtimeAssistant");
  realtime_on(session, "tool_call", 0);

  // When a tool_call event fires, send the result back
  // tool_call format: function_name(args) id=call_id
  realtime_tool_result(session, "call_abc123", "{\"temperature\": 22}");

  realtime_close(session);
}

Session Status #

neam

{
  let session = realtime_connect("RealtimeAssistant");
  let status = voice_session_status(session);
  emit "Session status: " + str(status);
  // Returns: "connected", "disconnected", or error details
  realtime_close(session);
}

Complete Real-Time Functions Reference #

Function	Args	Description
`realtime_connect(config_name)`	1	Open WebSocket session, returns session ID
`realtime_send_text(session, text)`	2	Send text message to agent
`realtime_send_audio(session, b64_pcm)`	2	Send base64-encoded PCM audio chunk
`realtime_on(session, event, callback)`	3	Register event handler (use `0` for default)
`realtime_interrupt(session)`	1	Cancel current response (barge-in)
`realtime_close(session)`	1	Close WebSocket session
`realtime_tool_result(session, call_id, result)`	3	Send function call output
`voice_commit_audio(session)`	1	Commit audio buffer as user turn
`voice_clear_audio(session)`	1	Clear audio buffer
`voice_request_response(session)`	1	Explicitly request a model response
`voice_session_status(session)`	1	Get session status

16.9 VAD (Voice Activity Detection) #

Voice Activity Detection determines when the user has finished speaking and the agent should respond. Neam supports three VAD modes:

Server VAD (Recommended) #

The provider's server detects speech start and end automatically. This is the simplest and most reliable mode.

neam

realtime_voice server_vad_example {
  agent: my_agent
  rt_provider: "openai"
  rt_model: "gpt-4o-realtime-preview"
  rt_voice: "coral"
  rt_vad: "server"
  vad_threshold: "0.5"
  silence_duration_ms: "500"
}

Tuning parameters:

vad_threshold (0.0-1.0): How sensitive the detector is to speech.
0.3 = Very sensitive. Picks up quiet speech but may trigger on background noise.
0.5 = Balanced. Good default for most environments.
0.8 = Less sensitive. Requires clear, loud speech. Good for noisy environments.
silence_duration_ms: How long the system waits after silence before treating the turn as complete.
300 = Fast response. May cut off users who pause mid-sentence.
500 = Balanced. Good default.
1000 = Patient. Waits for longer pauses. Good for complex questions.

Client VAD #

The client application manages turn detection. Use voice_commit_audio() to signal when the user has finished speaking.

neam

realtime_voice client_vad_example {
  agent: my_agent
  rt_provider: "openai"
  rt_model: "gpt-4o-realtime-preview"
  rt_voice: "echo"
  rt_vad: "client"
}

{
  let session = realtime_connect("client_vad_example");
  realtime_on(session, "response_text", 0);

  // Send audio chunks
  let audio = crypto_base64_encode(file_read_string("/tmp/speech.pcm"));
  realtime_send_audio(session, audio);

  // Signal end of turn (client decides when user is done)
  voice_commit_audio(session);

  time_sleep(5000);
  realtime_close(session);
}

Manual VAD #

No automatic detection. You must explicitly commit audio and request responses.

neam

realtime_voice manual_vad_example {
  agent: my_agent
  rt_provider: "openai"
  rt_model: "gpt-4o-realtime-preview"
  rt_voice: "sage"
  rt_vad: "manual"
}

{
  let session = realtime_connect("manual_vad_example");

  // Send audio
  realtime_send_audio(session, audio_data);

  // Commit the audio buffer
  voice_commit_audio(session);

  // Explicitly request a response
  voice_request_response(session);

  time_sleep(5000);
  realtime_close(session);
}

16.10 Real-Time Provider Comparison #

OpenAI Realtime #

neam

realtime_voice openai_rt {
  agent: my_agent
  rt_provider: "openai"
  rt_model: "gpt-4o-realtime-preview"
  rt_voice: "coral"
  rt_vad: "server"
  vad_threshold: "0.5"
  silence_duration_ms: "500"
  input_format: "pcm16"
  output_format: "pcm16"
  sample_rate: "24000"
  rt_speed: "1.0"
}

Models:

Model	Text Input	Audio Input	Audio Output
`gpt-4o-realtime-preview`	$4/1M tok	$32/1M tok (~$0.06/min)	$64/1M tok (~$0.24/min)
`gpt-realtime-mini`	$0.60/1M tok	$10/1M tok	$20/1M tok

Audio format: PCM16, 24kHz, mono, little-endian (base64-encoded).

Session duration: Unlimited.

Gemini Live #

neam

realtime_voice gemini_live {
  agent: my_agent
  rt_provider: "gemini"
  rt_model: "gemini-2.0-flash-live-001"
  rt_voice: "Puck"
  rt_vad: "server"
  vad_threshold: "0.5"
  silence_duration_ms: "500"
  input_format: "pcm16"
  output_format: "pcm16"
  sample_rate: "24000"
  rt_speed: "1.0"
}

Audio format: PCM16, 16kHz mono input / 24kHz mono output.

Session limit: Approximately 10 minutes per WebSocket connection (auto-resumable with handle tokens).

Voices: 30 voices available: Zephyr, Puck, Charon, Kore, and more. Supports 24 languages.

Local Streaming #

Combines whisper.cpp (STT) + Ollama (LLM) + Kokoro/Piper (TTS) for fully local real-time voice.

neam

agent local_llm {
  provider: "ollama"
  model: "llama3"
  system: "You are a helpful assistant. Keep responses concise."
}

realtime_voice local_stream {
  agent: local_llm
  rt_provider: "local"
  rt_model: "llama3"
  rt_voice: "af_heart"
  rt_vad: "server"
  vad_threshold: "0.5"
  silence_duration_ms: "500"
  input_format: "pcm16"
  output_format: "pcm16"
  sample_rate: "16000"
  rt_speed: "1.0"
  rt_stt_endpoint: "http://localhost:8080"
  rt_tts_endpoint: "http://localhost:8880"
  rt_llm_endpoint: "http://localhost:11434"
}

How local streaming works:

Audio segments are sent to whisper.cpp for transcription.
LLM response tokens stream from Ollama.
Complete sentences are synthesized via Kokoro/Piper incrementally.
Audio chunks are delivered as they become ready.

Provider Comparison Matrix #

Feature	OpenAI Realtime	Gemini Live	Local Pipeline
Protocol	WebSocket	WebSocket	REST (chunked)
Latency	Ultra-low	Ultra-low	Low-medium
Input Audio	PCM16 24kHz	PCM16 16kHz	PCM16 16kHz
Output Audio	PCM16/g711 24kHz	PCM16 24kHz	WAV/MP3
VAD	Server (configurable)	Auto activity detection	Client-side
Barge-in	Yes (interrupt)	Yes (automatic)	Manual
Function Calling	During stream	During stream	Via LLM
Session Duration	Unlimited	~10 min (resumable)	Per-connection
Multi-speaker	No	Yes (via config)	No
Cost	$0.06-0.24/min	Varies by model	Free (local)
API Key	Yes	Yes	No

16.11 Barge-In and Interruption Handling #

Barge-in is the ability for a user to interrupt the agent while it is speaking. This is essential for natural conversation flow.

OpenAI Realtime #

Use realtime_interrupt() to cancel the current response:

neam

{
  let session = realtime_connect("openai_rt");
  realtime_on(session, "response_text", 0);

  realtime_send_text(session, "Explain quantum physics in detail");
  time_sleep(2000);  // Let agent start responding

  // User interrupts
  realtime_interrupt(session);
  emit "[Interrupted]";

  // New question
  realtime_send_text(session, "Just give me a one-sentence summary");
  time_sleep(5000);

  realtime_close(session);
}

Gemini Live #

Gemini handles barge-in automatically via activity detection. When the server detects new user speech while the agent is responding, it stops the current response and processes the new input.

Local Pipeline #

Barge-in in local streaming is manual. Send a new audio segment while a response is being generated, and the system will interrupt the current TTS output.

16.12 Practical Example: Voice Assistant with macOS Microphone #

This example demonstrates a complete voice assistant workflow on macOS:

neam

agent voice_bot {
  provider: "openai"
  model: "gpt-4o-mini"
  system: "You are a voice assistant. Keep responses under 50 words.
           Be conversational and friendly."
}

voice assistant {
  agent: voice_bot
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "openai"
  tts_model: "tts-1"
  tts_voice: "nova"
}

{
  emit "=== Voice Assistant ===";
  emit "Recording from microphone...";
  emit "(Record audio to /tmp/mic.wav first, then run this program)";
  emit "";

  let result = voice_pipeline_run("assistant", "/tmp/mic.wav", "/tmp/response.wav");
  emit "You said: " + result["input_text"];
  emit "Response: " + result["response_text"];
  emit "Audio saved to: " + result["output_audio"];
}

Recording audio on macOS:

bash

# Install SoX for command-line recording (one-time)
brew install sox

# Record (press Ctrl+C to stop)
rec -r 16000 -c 1 -b 16 /tmp/mic.wav

# Compile and run the voice assistant
./neamc voice_assistant.neam -o voice_assistant.neamb
./neam voice_assistant.neamb

# Play the response (afplay is built into macOS)
afplay /tmp/response.wav

16.13 Practical Example: Knowledge-Augmented Voice Agent #

Combine voice with RAG for a voice-enabled documentation assistant:

neam

knowledge docs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [
    { type: "file", path: "./README.md" }
  ]
  retrieval_strategy: "basic"
  top_k: 3
}

agent doc_bot {
  provider: "openai"
  model: "gpt-4o-mini"
  system: "You are a documentation assistant. Answer using only the provided
           context. Keep responses under 3 sentences for voice."
  connected_knowledge: [docs]
}

voice doc_voice {
  agent: doc_bot
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "openai"
  tts_model: "tts-1"
  tts_voice: "alloy"
}

{
  let result = voice_pipeline_run("doc_voice", "/tmp/question.wav", "/tmp/answer.wav");
  emit "Question: " + result["input_text"];
  emit "Answer: " + result["response_text"];
}

16.14 Voice Standard Library Modules #

Beyond the native functions and declarations, Neam's standard library provides a rich set of modules for building production voice systems. These modules handle concerns that go beyond basic STT/TTS, including session management, transcripts, budgets, and compliance.

Voice Agent Builder (`std.voice.agent`) #

The voice agent builder provides a higher-level API for constructing voice agents with configurable policies and behaviors:

neam

import std::voice::agent;

let config = agent::create_voice_agent({
  "agent": "voice_bot",
  "pipeline": "assistant",
  "max_turns": 20,
  "greeting": "Hello! How can I help you today?",
  "farewell": "Goodbye! Have a great day.",
  "idle_timeout_ms": 30000,
  "error_message": "I'm sorry, I didn't catch that. Could you repeat?"
});

Session Management (`std.voice.session`) #

Voice sessions track the lifecycle of a conversation, including connection state, turn history, and accumulated context:

neam

import std::voice::session;

let sess = session::create_session("RealtimeAssistant");
let status = session::get_status(sess);       // "connected", "idle", "active"
let duration = session::get_duration(sess);   // elapsed time in ms
let turn_count = session::get_turn_count(sess);
session::end_session(sess);

Transcript Handling (`std.voice.transcript`) #

The transcript module captures and structures the full conversation history, useful for logging, analysis, and compliance:

neam

import std::voice::transcript;

let tx = transcript::create_transcript();
tx = transcript::add_turn(tx, "user", "What's the weather today?");
tx = transcript::add_turn(tx, "agent", "It's sunny and 72 degrees.");

let full_text = transcript::to_text(tx);
let json_export = transcript::to_json(tx);

Voice Budgets (`std.voice.budget`) #

Voice interactions can be expensive. The budget module tracks costs and enforces limits across token usage, API calls, and wall-clock time:

neam

import std::voice::budget;

let voice_budget = budget::create_budget({
  "max_cost_usd": 5.00,
  "max_turns": 50,
  "max_duration_ms": 600000
});

let remaining = budget::get_remaining(voice_budget);
let is_exceeded = budget::is_exceeded(voice_budget);

Additional Voice Modules #

Module	Purpose
`std.voice.turn`	Turn-level management (start, end, metadata)
`std.voice.policy`	Conversation policies (max silence, retry limits)
`std.voice.metrics`	Latency, word count, turn duration tracking
`std.voice.audit`	Audit logging for compliance and debugging
`std.voice.consent`	User consent management for recording
`std.voice.taint`	Data tainting for sensitive audio content
`std.voice.tracing`	Distributed tracing for voice pipelines
`std.voice.playback`	Audio playback utilities

16.15 Audio Utilities #

The standard library provides low-level audio utilities (std.audio) for working with audio data directly, which is useful for custom voice pipelines and pre/post processing.

Audio Resampling #

When mixing providers that use different sample rates (e.g., 16kHz whisper.cpp input with 24kHz OpenAI Realtime output), use the built-in resample function:

neam

{
  let resampled = voice_audio_resample(pcm_data, 16000, 24000);
}

Audio Buffer and Codec Modules #

Module	Purpose
`std.audio.buffer`	Ring buffers for streaming audio, append/read/drain operations
`std.audio.codec`	Encode and decode between PCM, WAV, MP3, Opus, FLAC
`std.audio.stream`	Streaming audio source/sink abstractions
`std.audio.meter`	Real-time audio level metering (RMS, peak, silence detection)

neam

import std::audio::meter;

let level = meter::get_rms(audio_chunk);
let is_silent = meter::is_silent(audio_chunk, -40.0);  // threshold in dB

Audio metering is particularly useful for implementing custom VAD logic when the built-in server/client/manual modes do not fit your use case.

16.16 Speech Configuration in the Standard Library #

The std.speech modules provide detailed configuration for STT and TTS beyond what the voice and realtime_voice declarations expose directly.

STT Configuration (`std.speech.stt`) #

neam

import std::speech::stt;

let stt_config = stt::create_config({
  "language": "en-US",
  "punctuation": true,
  "profanity_filter": false,
  "diarization": true,      // speaker identification
  "word_timestamps": true
});

Endpointing modes control how the STT engine detects the end of an utterance:

Mode	Description
`"aggressive"`	Quick end-of-speech detection, responsive but may cut off
`"normal"`	Balanced detection (default)
`"conservative"`	Waits longer before ending, good for complex speech
`"manual"`	No automatic detection; caller controls

TTS Configuration (`std.speech.tts`) #

neam

import std::speech::tts;

let tts_config = tts::create_config({
  "voice": "nova",
  "language": "en-US",
  "speed": 1.0,
  "pitch": 0.0,
  "volume": 1.0,
  "style": "conversational"
});

The TTS module also supports SSML (Speech Synthesis Markup Language) for fine-grained control over pronunciation, pauses, and emphasis:

neam

let ssml_text = "<speak>Welcome to <emphasis level='strong'>Neam</emphasis>. " +
                "<break time='500ms'/> How can I help you today?</speak>";

📝 Note

SSML support varies by provider. OpenAI TTS uses tts_instructions instead of SSML markup. Gemini and ElevenLabs support SSML natively.

Summary #

In this chapter you learned:

Voice agent architecture: The three-stage STT -> Agent -> TTS pipeline.
Batch pipelines: Processing audio files end-to-end with the voice declaration.
Voice functions: voice_transcribe(), voice_synthesize(), and voice_pipeline_run() for batch operations.
STT providers: OpenAI Whisper, Gemini, and local whisper.cpp.
TTS providers: OpenAI, Gemini, Kokoro, Piper, and ElevenLabs.
Cross-provider pipelines: Mixing any STT, agent, and TTS provider.
Real-time streaming: Full-duplex WebSocket voice with realtime_voice.
VAD configuration: Server, client, and manual voice activity detection modes.
Barge-in handling: Interrupting the agent mid-response.
Provider comparison: Trade-offs between OpenAI Realtime, Gemini Live, and local streaming.
Voice standard library modules: Session management, transcript handling, voice budgets, turn management, policies, metrics, audit logging, and consent tracking.
Audio utilities: Resampling between sample rates, audio buffers, codecs, streaming abstractions, and real-time metering for custom VAD.
Speech configuration: Fine-grained STT settings (diarization, endpointing modes) and TTS settings (pitch, speed, style, SSML support).

Exercises #

Exercise 16.1: Basic Voice Pipeline #

Create a batch voice pipeline using OpenAI Whisper for STT and OpenAI TTS. Record a short audio clip (or use a sample WAV file), run it through the pipeline, and verify that the transcription, response, and synthesized audio are all correct.

Exercise 16.2: Local Voice Setup #

Set up a fully local voice pipeline using whisper.cpp for STT, Ollama for the agent, and Kokoro for TTS. Document the setup process, including starting each service. Run the same audio clip from Exercise 16.1 and compare the results.

Exercise 16.3: Cross-Provider Pipeline #

Create a pipeline that uses Gemini for STT, OpenAI for the agent, and Kokoro for TTS. Test it with audio in a non-English language (e.g., Spanish, French, or Japanese). How does the transcription quality compare to OpenAI Whisper for that language?

Exercise 16.4: VAD Sensitivity Tuning #

Create two real-time voice configurations with different VAD settings: - "Sensitive": vad_threshold: "0.3", silence_duration_ms: "300" - "Relaxed": vad_threshold: "0.8", silence_duration_ms: "1000"

Test both with the same conversation. Document the differences in user experience. Under what conditions would you prefer each setting?

Exercise 16.5: Voice-Enabled RAG Bot #

Build a voice agent that uses a RAG knowledge base to answer questions about a topic of your choice. The pipeline should: (a) transcribe a question from audio, (b) retrieve relevant context, (c) generate an answer, and (d) synthesize the answer to audio. Test with at least three different questions.

Exercise 16.6: Real-Time Conversation #

Using OpenAI Realtime or Gemini Live, implement a real-time voice session that: 1. Connects to the provider. 2. Sends a greeting via text. 3. Handles the response events. 4. Sends a follow-up question. 5. Demonstrates barge-in (interrupt the agent mid-response). 6. Cleanly closes the session.

Document the latency you observe at each step.

Chapter 16: Voice Agents #

16.1 Voice Agent Architecture #

16.2 Batch Voice Pipelines #

Defining a Batch Pipeline #

Voice Pipeline Fields #

16.3 Voice Native Functions #

voice_transcribe(pipeline, audio_path) #

voice_synthesize(pipeline, text, output_path) #

voice_pipeline_run(pipeline, input_path, output_path) #

16.4 STT Providers #

OpenAI Whisper (Cloud) #

Gemini STT (Cloud) #

Local whisper.cpp #

16.5 TTS Providers #

OpenAI TTS (Cloud) #

Gemini TTS (Cloud) #

Kokoro (Local) #

Piper (Local) #

ElevenLabs (Cloud) #

16.6 Cross-Provider Pipelines #

Fully Local Pipeline (No Cloud) #

16.7 Real-Time Voice Streaming #

Defining a Real-Time Voice Agent #

Real-Time Configuration Fields #

16.8 Real-Time Functions #

Connection Management #

Sending Data #

Event Handling #

Interruption (Barge-In) #

Tool Call Handling #

Session Status #

Complete Real-Time Functions Reference #

16.9 VAD (Voice Activity Detection) #

Server VAD (Recommended) #

Client VAD #

Manual VAD #

16.10 Real-Time Provider Comparison #

OpenAI Realtime #

Gemini Live #

Local Streaming #

Provider Comparison Matrix #

16.11 Barge-In and Interruption Handling #

OpenAI Realtime #

Gemini Live #

Local Pipeline #

16.12 Practical Example: Voice Assistant with macOS Microphone #

16.13 Practical Example: Knowledge-Augmented Voice Agent #

16.14 Voice Standard Library Modules #

Voice Agent Builder (std.voice.agent) #

Session Management (std.voice.session) #

Transcript Handling (std.voice.transcript) #

Voice Budgets (std.voice.budget) #

Additional Voice Modules #

16.15 Audio Utilities #

Audio Resampling #

Audio Buffer and Codec Modules #

16.16 Speech Configuration in the Standard Library #

STT Configuration (std.speech.stt) #

TTS Configuration (std.speech.tts) #

Summary #

Exercises #

Exercise 16.1: Basic Voice Pipeline #

Exercise 16.2: Local Voice Setup #

Exercise 16.3: Cross-Provider Pipeline #

Exercise 16.4: VAD Sensitivity Tuning #

Exercise 16.5: Voice-Enabled RAG Bot #

Exercise 16.6: Real-Time Conversation #

`voice_transcribe(pipeline, audio_path)` #

`voice_synthesize(pipeline, text, output_path)` #

`voice_pipeline_run(pipeline, input_path, output_path)` #

Voice Agent Builder (`std.voice.agent`) #

Session Management (`std.voice.session`) #

Transcript Handling (`std.voice.transcript`) #

Voice Budgets (`std.voice.budget`) #

STT Configuration (`std.speech.stt`) #

TTS Configuration (`std.speech.tts`) #