Programming Neam
📖 11 min read

Chapter 16: Voice Agents #

"The next interface is not a screen -- it is a conversation."

Text-based agents are powerful, but many real-world applications demand voice interaction: customer service hotlines, in-car assistants, accessibility tools, hands-free operation in industrial settings, and smart home devices. Building a voice agent traditionally requires stitching together separate speech-to-text (STT), language model, and text-to-speech (TTS) services, each with its own API, audio format, and error handling.

Neam makes voice a first-class construct. With the voice and realtime_voice declarations, you define complete voice pipelines in a few lines. The runtime handles audio encoding, API calls, format conversion, and event streaming. This chapter covers both batch pipelines (process an audio file end-to-end) and real-time streaming (full-duplex WebSocket voice with sub-second latency).


16.1 Voice Agent Architecture #

A voice agent consists of three stages connected in series:

Audio Input
STT
(Speech
to Text)
Agent
(LLM)
TTS
(Text to
Speech)

STT (Speech-to-Text): Converts audio into a text transcription.

Agent (LLM): Processes the transcribed text and generates a response. This is a standard Neam agent -- it can use tools, knowledge bases, guardrails, and all other agent features.

TTS (Text-to-Speech): Converts the agent's text response into audio.

The beauty of this architecture is that each stage is independently configurable. You can use OpenAI Whisper for STT, Anthropic Claude for the agent, and a local Kokoro instance for TTS -- all in the same pipeline.


16.2 Batch Voice Pipelines #

Batch pipelines process audio files through the STT -> Agent -> TTS flow. They are ideal for processing recorded audio, building voice message systems, or testing voice interactions without a live microphone.

Defining a Batch Pipeline #

neam
// Step 1: Define the agent
agent assistant {
  provider: "openai"
  model: "gpt-4o-mini"
  system: "You are a helpful voice assistant. Keep responses brief and conversational."
}

// Step 2: Define the voice pipeline
voice my_pipeline {
  agent: assistant
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "openai"
  tts_model: "tts-1"
  tts_voice: "alloy"
}

The voice declaration connects an existing agent to STT and TTS providers. The agent itself is unchanged -- it still processes text. The voice pipeline wraps it with audio I/O.

Voice Pipeline Fields #

Field Type Required Default Description
agent identifier Yes -- The agent to process transcribed text
stt_provider string No "whisper" STT provider name
stt_model string No "whisper-1" STT model identifier
stt_endpoint string No -- Custom endpoint URL (local providers)
stt_language string No -- ISO-639-1 language code (e.g., "en", "es")
stt_format string No "json" Response format
tts_provider string No "openai" TTS provider name
tts_model string No "tts-1" TTS model identifier
tts_voice string No "alloy" Voice name
tts_endpoint string No -- Custom endpoint URL (local providers)
tts_format string No "mp3" Output audio format
tts_speed string No "1.0" Playback speed (0.25-4.0, OpenAI only)
tts_instructions string No -- Tone/style instructions (gpt-4o-mini-tts only)

16.3 Voice Native Functions #

Neam provides three native functions for batch voice pipelines:

voice_transcribe(pipeline, audio_path) #

Transcribes an audio file to text using the pipeline's STT provider.

neam
{
  let text = voice_transcribe("my_pipeline", "/tmp/input.wav");
  emit "You said: " + text;
}

Parameters: - pipeline (string): The name of the voice pipeline declaration. - audio_path (string): Path to the input audio file (WAV, MP3, FLAC, etc.).

Returns: A string containing the transcription.

voice_synthesize(pipeline, text, output_path) #

Synthesizes text into an audio file using the pipeline's TTS provider.

neam
{
  let audio_path = voice_synthesize("my_pipeline", "Hello there!", "/tmp/greeting.mp3");
  emit "Audio saved to: " + audio_path;
}

Parameters: - pipeline (string): The name of the voice pipeline declaration. - text (string): The text to synthesize. - output_path (string): Path where the output audio file will be saved.

Returns: A string containing the output file path.

voice_pipeline_run(pipeline, input_path, output_path) #

Runs the complete STT -> Agent -> TTS pipeline on an audio file.

neam
{
  let result = voice_pipeline_run("my_pipeline", "/tmp/input.wav", "/tmp/output.mp3");
  emit "Input:    " + result["input_text"];
  emit "Response: " + result["response_text"];
  emit "Audio:    " + result["output_audio"];
}

Parameters: - pipeline (string): The name of the voice pipeline declaration. - input_path (string): Path to the input audio file. - output_path (string): Path where the output audio file will be saved.

Returns: A map with three keys: - "input_text": The transcribed user speech. - "response_text": The agent's text response. - "output_audio": The path to the generated audio file.


16.4 STT Providers #

Neam supports four speech-to-text providers.

OpenAI Whisper (Cloud) #

The default STT provider. Uses OpenAI's Whisper API.

neam
voice whisper_pipeline {
  agent: my_agent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  stt_language: "en"
  stt_format: "json"
  tts_provider: "openai"
  tts_model: "tts-1"
  tts_voice: "alloy"
}

Available models:

Model Cost Notes
whisper-1 $0.006/min Standard model, 25 MB file limit
gpt-4o-transcribe $0.006/min Lower word error rate, fewer hallucinations
gpt-4o-mini-transcribe $0.003/min Half the cost of standard

Supported formats: WAV, MP3, M4A, FLAC, OGG, WebM (max 25 MB).

Requires: OPENAI_API_KEY environment variable.

Gemini STT (Cloud) #

Uses Google's Gemini models for audio understanding via the generateContent API.

neam
voice gemini_stt_pipeline {
  agent: my_agent
  stt_provider: "gemini"
  stt_model: "gemini-2.0-flash"
  tts_provider: "openai"
  tts_model: "tts-1"
  tts_voice: "alloy"
}

Gemini handles audio natively -- it does not use a separate STT model. The audio is sent as part of the multimodal input, and Gemini produces a text transcription.

Supported formats: WAV, MP3, AIFF, AAC, OGG, FLAC (max 20 MB inline).

Requires: GEMINI_API_KEY environment variable.

Local whisper.cpp #

Uses a locally running whisper.cpp server with an OpenAI-compatible API. No cloud dependency, no API key, and your audio data never leaves your machine.

neam
voice local_stt_pipeline {
  agent: my_agent
  stt_provider: "whisper-local"
  stt_model: "base.en"
  stt_endpoint: "http://localhost:8080"
  tts_provider: "kokoro"
  tts_model: "tts-1"
  tts_voice: "af_heart"
  tts_endpoint: "http://localhost:8880"
}

Setting up whisper.cpp:

bash
# Clone and build
git clone https://github.com/ggerganov/whisper.cpp && cd whisper.cpp
cmake -B build && cmake --build build --parallel

# Download a model
./build/bin/whisper-cli --download-model base.en

# Start the server
./build/bin/whisper-server -m models/ggml-base.en.bin --port 8080

Available models:

Model Size Speed Accuracy
tiny.en 75 MB Fastest Good for short commands
base.en 142 MB Fast Good general purpose
small.en 466 MB Medium Better accuracy
medium.en 1.5 GB Slow High accuracy
large-v3 3.1 GB Slowest Best accuracy, multilingual

16.5 TTS Providers #

Neam supports five text-to-speech providers.

OpenAI TTS (Cloud) #

neam
voice openai_tts_pipeline {
  agent: my_agent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "openai"
  tts_model: "tts-1"
  tts_voice: "nova"
  tts_format: "mp3"
  tts_speed: "1.0"
}

Models:

Model Cost Notes
tts-1 $15/1M chars Low latency, good for real-time
tts-1-hd $30/1M chars Higher audio quality
gpt-4o-mini-tts ~$0.015/min Best quality, supports tts_instructions

Voices: alloy, ash, ballad, cedar, coral, echo, fable, marin, nova, onyx, sage, shimmer, verse

Output formats: MP3, Opus, AAC, FLAC, WAV, PCM

The gpt-4o-mini-tts model supports a tts_instructions field for controlling tone and style:

neam
voice expressive_pipeline {
  agent: my_agent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "openai"
  tts_model: "gpt-4o-mini-tts"
  tts_voice: "coral"
  tts_instructions: "Speak in a warm, friendly tone with natural pauses."
}

Requires: OPENAI_API_KEY

Gemini TTS (Cloud) #

neam
voice gemini_tts_pipeline {
  agent: my_agent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "gemini"
  tts_model: "gemini-2.5-flash-preview-tts"
  tts_voice: "Kore"
}

Voices: 30 voices available including Zephyr, Puck, Charon, Kore, and more. Supports 24 languages and multi-speaker synthesis.

Output: PCM 24kHz 16-bit mono.

Requires: GEMINI_API_KEY

Kokoro (Local) #

Kokoro-FastAPI provides high-quality local TTS with an OpenAI-compatible API.

neam
voice kokoro_pipeline {
  agent: my_agent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "kokoro"
  tts_model: "tts-1"
  tts_voice: "af_heart"
  tts_endpoint: "http://localhost:8880"
}

Starting Kokoro:

bash
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi

No API key required. Audio data stays local.

Piper (Local) #

Piper is a lightweight, fast TTS engine designed for edge deployment.

neam
voice piper_pipeline {
  agent: my_agent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "piper"
  tts_model: "tts-1"
  tts_voice: "alloy"
  tts_endpoint: "http://localhost:5000"
}

Starting Piper:

bash
# Piper runs as a local HTTP server
# See https://github.com/rhasspy/piper for installation instructions

No API key required.

ElevenLabs (Cloud) #

ElevenLabs provides premium voice synthesis with highly realistic voices.

neam
voice elevenlabs_pipeline {
  agent: my_agent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "elevenlabs"
  tts_model: "eleven_multilingual_v2"
  tts_voice: "Rachel"
  tts_endpoint: "https://api.elevenlabs.io"
}

Requires: ELEVENLABS_API_KEY


16.6 Cross-Provider Pipelines #

One of Neam's strengths is the ability to mix and match any STT provider, agent (LLM), and TTS provider in a single pipeline. This lets you optimize each stage independently.

neam
// Gemini STT (good multilingual) -> OpenAI Agent -> Kokoro TTS (free, local)
agent smart_agent {
  provider: "openai"
  model: "gpt-4o-mini"
  system: "You are a multilingual voice assistant."
}

voice cross_provider {
  agent: smart_agent
  stt_provider: "gemini"
  stt_model: "gemini-2.0-flash"
  stt_language: "es"
  tts_provider: "kokoro"
  tts_model: "tts-1"
  tts_voice: "af_heart"
  tts_endpoint: "http://localhost:8880"
}

{
  let result = voice_pipeline_run("cross_provider", "/tmp/spanish.wav", "/tmp/response.wav");
  emit "Transcription: " + result["input_text"];
  emit "Response: " + result["response_text"];
}

Fully Local Pipeline (No Cloud) #

For maximum privacy and offline operation:

neam
agent local_bot {
  provider: "ollama"
  model: "llama3"
  system: "You are a helpful local assistant. Keep responses under 50 words."
}

voice fully_local {
  agent: local_bot
  stt_provider: "whisper-local"
  stt_model: "base.en"
  stt_endpoint: "http://localhost:8080"
  tts_provider: "kokoro"
  tts_model: "tts-1"
  tts_voice: "af_heart"
  tts_endpoint: "http://localhost:8880"
}

{
  let result = voice_pipeline_run("fully_local", "/tmp/mic.wav", "/tmp/response.wav");
  emit "You said: " + result["input_text"];
  emit "Response: " + result["response_text"];
}

Prerequisites for the fully local pipeline:

bash
# 1. Start Ollama
ollama pull llama3
ollama serve

# 2. Start whisper.cpp
cd whisper.cpp
./build/bin/whisper-server -m models/ggml-base.en.bin --port 8080

# 3. Start Kokoro
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi

16.7 Real-Time Voice Streaming #

Batch pipelines process complete audio files. Real-time streaming provides full-duplex WebSocket communication for live, interactive voice conversations with sub-second latency.

PCM Audio
Encoder
PCM Audio
Decoder
WebSocket Connection
Client Events: Server Events:
- audio chunk ------> - transcript
- text message ------> - response_text
- commit_audio ------> - response_audio
- interrupt ------> - tool_call
- error
OpenAI
Realtime
API
Gemini
Live
API
Local
Pipeline
(whisper+
Ollama+
Kokoro)

Defining a Real-Time Voice Agent #

neam
agent rt_assistant {
  provider: "openai"
  model: "gpt-4o-mini"
  system: "You are a helpful voice assistant. Keep responses concise."
}

realtime_voice RealtimeAssistant {
  agent: rt_assistant
  rt_provider: "openai"
  rt_model: "gpt-4o-realtime-preview"
  rt_voice: "coral"
  rt_vad: "server"
  vad_threshold: "0.5"
  silence_duration_ms: "500"
  input_format: "pcm16"
  output_format: "pcm16"
  sample_rate: "24000"
  rt_speed: "1.0"
}

Real-Time Configuration Fields #

Field Type Required Default Description
agent identifier Yes -- The agent for this voice session
rt_provider string Yes -- Provider: "openai", "gemini", "local"
rt_model string Yes -- Real-time model identifier
rt_voice string No "alloy" Output voice name
rt_vad string No "server" VAD mode: "server", "client", "manual"
vad_threshold string No "0.5" VAD sensitivity (0.0-1.0)
silence_duration_ms string No "500" Silence before end-of-turn (ms)
input_format string No "pcm16" Input audio format
output_format string No "pcm16" Output format: pcm16, g711_ulaw, g711_alaw
sample_rate string No "24000" Audio sample rate in Hz
rt_speed string No "1.0" Playback speed (OpenAI: 0.25-1.5)

16.8 Real-Time Functions #

Connection Management #

neam
{
  // Open a WebSocket session
  let session = realtime_connect("RealtimeAssistant");

  // ... interact with the session ...

  // Close the session
  realtime_close(session);
}

Sending Data #

neam
{
  let session = realtime_connect("RealtimeAssistant");

  // Send a text message (agent responds with audio + text)
  realtime_send_text(session, "Hello, how are you?");

  // Send audio data (base64-encoded PCM)
  let audio_data = crypto_base64_encode(file_read_string("/tmp/audio.pcm"));
  realtime_send_audio(session, audio_data);

  // Signal end of user's turn (client/manual VAD mode)
  voice_commit_audio(session);

  // Clear the audio buffer without committing
  voice_clear_audio(session);

  // Explicitly request a model response
  voice_request_response(session);

  realtime_close(session);
}

Event Handling #

Register callbacks to handle events from the server:

neam
{
  let session = realtime_connect("RealtimeAssistant");

  // Register event handlers (use 0 for default print handler)
  realtime_on(session, "transcript", 0);      // User speech transcription
  realtime_on(session, "response_text", 0);   // Agent text response
  realtime_on(session, "response_audio", 0);  // Agent audio chunks
  realtime_on(session, "tool_call", 0);       // Function calls from model
  realtime_on(session, "error", 0);           // Error messages

  realtime_send_text(session, "Tell me a joke");
  time_sleep(5000);  // Wait for response

  realtime_close(session);
}

Event output format with default handler:

text
[transcript] What the user said
[response] Here is a joke for you...
[tool_call] get_weather({"city": "Paris"}) id=call_abc123
[error] Connection timeout

Interruption (Barge-In) #

neam
{
  let session = realtime_connect("RealtimeAssistant");
  realtime_on(session, "response_text", 0);

  // Ask for a long response
  realtime_send_text(session, "Tell me a long story about a dragon");
  time_sleep(1000);

  // Interrupt mid-response
  realtime_interrupt(session);

  // Send a new message
  realtime_send_text(session, "Actually, just say hello");
  time_sleep(5000);

  realtime_close(session);
}

Tool Call Handling #

When the model invokes a function during a real-time session:

neam
{
  let session = realtime_connect("RealtimeAssistant");
  realtime_on(session, "tool_call", 0);

  // When a tool_call event fires, send the result back
  // tool_call format: function_name(args) id=call_id
  realtime_tool_result(session, "call_abc123", "{\"temperature\": 22}");

  realtime_close(session);
}

Session Status #

neam
{
  let session = realtime_connect("RealtimeAssistant");
  let status = voice_session_status(session);
  emit "Session status: " + str(status);
  // Returns: "connected", "disconnected", or error details
  realtime_close(session);
}

Complete Real-Time Functions Reference #

Function Args Description
realtime_connect(config_name) 1 Open WebSocket session, returns session ID
realtime_send_text(session, text) 2 Send text message to agent
realtime_send_audio(session, b64_pcm) 2 Send base64-encoded PCM audio chunk
realtime_on(session, event, callback) 3 Register event handler (use 0 for default)
realtime_interrupt(session) 1 Cancel current response (barge-in)
realtime_close(session) 1 Close WebSocket session
realtime_tool_result(session, call_id, result) 3 Send function call output
voice_commit_audio(session) 1 Commit audio buffer as user turn
voice_clear_audio(session) 1 Clear audio buffer
voice_request_response(session) 1 Explicitly request a model response
voice_session_status(session) 1 Get session status

16.9 VAD (Voice Activity Detection) #

Voice Activity Detection determines when the user has finished speaking and the agent should respond. Neam supports three VAD modes:

The provider's server detects speech start and end automatically. This is the simplest and most reliable mode.

neam
realtime_voice server_vad_example {
  agent: my_agent
  rt_provider: "openai"
  rt_model: "gpt-4o-realtime-preview"
  rt_voice: "coral"
  rt_vad: "server"
  vad_threshold: "0.5"
  silence_duration_ms: "500"
}

Tuning parameters:

Client VAD #

The client application manages turn detection. Use voice_commit_audio() to signal when the user has finished speaking.

neam
realtime_voice client_vad_example {
  agent: my_agent
  rt_provider: "openai"
  rt_model: "gpt-4o-realtime-preview"
  rt_voice: "echo"
  rt_vad: "client"
}

{
  let session = realtime_connect("client_vad_example");
  realtime_on(session, "response_text", 0);

  // Send audio chunks
  let audio = crypto_base64_encode(file_read_string("/tmp/speech.pcm"));
  realtime_send_audio(session, audio);

  // Signal end of turn (client decides when user is done)
  voice_commit_audio(session);

  time_sleep(5000);
  realtime_close(session);
}

Manual VAD #

No automatic detection. You must explicitly commit audio and request responses.

neam
realtime_voice manual_vad_example {
  agent: my_agent
  rt_provider: "openai"
  rt_model: "gpt-4o-realtime-preview"
  rt_voice: "sage"
  rt_vad: "manual"
}

{
  let session = realtime_connect("manual_vad_example");

  // Send audio
  realtime_send_audio(session, audio_data);

  // Commit the audio buffer
  voice_commit_audio(session);

  // Explicitly request a response
  voice_request_response(session);

  time_sleep(5000);
  realtime_close(session);
}

16.10 Real-Time Provider Comparison #

OpenAI Realtime #

neam
realtime_voice openai_rt {
  agent: my_agent
  rt_provider: "openai"
  rt_model: "gpt-4o-realtime-preview"
  rt_voice: "coral"
  rt_vad: "server"
  vad_threshold: "0.5"
  silence_duration_ms: "500"
  input_format: "pcm16"
  output_format: "pcm16"
  sample_rate: "24000"
  rt_speed: "1.0"
}

Models:

Model Text Input Audio Input Audio Output
gpt-4o-realtime-preview $4/1M tok $32/1M tok (~$0.06/min) $64/1M tok (~$0.24/min)
gpt-realtime-mini $0.60/1M tok $10/1M tok $20/1M tok

Audio format: PCM16, 24kHz, mono, little-endian (base64-encoded).

Session duration: Unlimited.

Gemini Live #

neam
realtime_voice gemini_live {
  agent: my_agent
  rt_provider: "gemini"
  rt_model: "gemini-2.0-flash-live-001"
  rt_voice: "Puck"
  rt_vad: "server"
  vad_threshold: "0.5"
  silence_duration_ms: "500"
  input_format: "pcm16"
  output_format: "pcm16"
  sample_rate: "24000"
  rt_speed: "1.0"
}

Audio format: PCM16, 16kHz mono input / 24kHz mono output.

Session limit: Approximately 10 minutes per WebSocket connection (auto-resumable with handle tokens).

Voices: 30 voices available: Zephyr, Puck, Charon, Kore, and more. Supports 24 languages.

Local Streaming #

Combines whisper.cpp (STT) + Ollama (LLM) + Kokoro/Piper (TTS) for fully local real-time voice.

neam
agent local_llm {
  provider: "ollama"
  model: "llama3"
  system: "You are a helpful assistant. Keep responses concise."
}

realtime_voice local_stream {
  agent: local_llm
  rt_provider: "local"
  rt_model: "llama3"
  rt_voice: "af_heart"
  rt_vad: "server"
  vad_threshold: "0.5"
  silence_duration_ms: "500"
  input_format: "pcm16"
  output_format: "pcm16"
  sample_rate: "16000"
  rt_speed: "1.0"
  rt_stt_endpoint: "http://localhost:8080"
  rt_tts_endpoint: "http://localhost:8880"
  rt_llm_endpoint: "http://localhost:11434"
}

How local streaming works:

  1. Audio segments are sent to whisper.cpp for transcription.
  2. LLM response tokens stream from Ollama.
  3. Complete sentences are synthesized via Kokoro/Piper incrementally.
  4. Audio chunks are delivered as they become ready.

Provider Comparison Matrix #

Feature OpenAI Realtime Gemini Live Local Pipeline
Protocol WebSocket WebSocket REST (chunked)
Latency Ultra-low Ultra-low Low-medium
Input Audio PCM16 24kHz PCM16 16kHz PCM16 16kHz
Output Audio PCM16/g711 24kHz PCM16 24kHz WAV/MP3
VAD Server (configurable) Auto activity detection Client-side
Barge-in Yes (interrupt) Yes (automatic) Manual
Function Calling During stream During stream Via LLM
Session Duration Unlimited ~10 min (resumable) Per-connection
Multi-speaker No Yes (via config) No
Cost $0.06-0.24/min Varies by model Free (local)
API Key Yes Yes No

16.11 Barge-In and Interruption Handling #

Barge-in is the ability for a user to interrupt the agent while it is speaking. This is essential for natural conversation flow.

OpenAI Realtime #

Use realtime_interrupt() to cancel the current response:

neam
{
  let session = realtime_connect("openai_rt");
  realtime_on(session, "response_text", 0);

  realtime_send_text(session, "Explain quantum physics in detail");
  time_sleep(2000);  // Let agent start responding

  // User interrupts
  realtime_interrupt(session);
  emit "[Interrupted]";

  // New question
  realtime_send_text(session, "Just give me a one-sentence summary");
  time_sleep(5000);

  realtime_close(session);
}

Gemini Live #

Gemini handles barge-in automatically via activity detection. When the server detects new user speech while the agent is responding, it stops the current response and processes the new input.

Local Pipeline #

Barge-in in local streaming is manual. Send a new audio segment while a response is being generated, and the system will interrupt the current TTS output.


16.12 Practical Example: Voice Assistant with macOS Microphone #

This example demonstrates a complete voice assistant workflow on macOS:

neam
agent voice_bot {
  provider: "openai"
  model: "gpt-4o-mini"
  system: "You are a voice assistant. Keep responses under 50 words.
           Be conversational and friendly."
}

voice assistant {
  agent: voice_bot
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "openai"
  tts_model: "tts-1"
  tts_voice: "nova"
}

{
  emit "=== Voice Assistant ===";
  emit "Recording from microphone...";
  emit "(Record audio to /tmp/mic.wav first, then run this program)";
  emit "";

  let result = voice_pipeline_run("assistant", "/tmp/mic.wav", "/tmp/response.wav");
  emit "You said: " + result["input_text"];
  emit "Response: " + result["response_text"];
  emit "Audio saved to: " + result["output_audio"];
}

Recording audio on macOS:

bash
# Install SoX for command-line recording (one-time)
brew install sox

# Record (press Ctrl+C to stop)
rec -r 16000 -c 1 -b 16 /tmp/mic.wav

# Compile and run the voice assistant
./neamc voice_assistant.neam -o voice_assistant.neamb
./neam voice_assistant.neamb

# Play the response (afplay is built into macOS)
afplay /tmp/response.wav

16.13 Practical Example: Knowledge-Augmented Voice Agent #

Combine voice with RAG for a voice-enabled documentation assistant:

neam
knowledge docs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [
    { type: "file", path: "./README.md" }
  ]
  retrieval_strategy: "basic"
  top_k: 3
}

agent doc_bot {
  provider: "openai"
  model: "gpt-4o-mini"
  system: "You are a documentation assistant. Answer using only the provided
           context. Keep responses under 3 sentences for voice."
  connected_knowledge: [docs]
}

voice doc_voice {
  agent: doc_bot
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "openai"
  tts_model: "tts-1"
  tts_voice: "alloy"
}

{
  let result = voice_pipeline_run("doc_voice", "/tmp/question.wav", "/tmp/answer.wav");
  emit "Question: " + result["input_text"];
  emit "Answer: " + result["response_text"];
}

16.14 Voice Standard Library Modules #

Beyond the native functions and declarations, Neam's standard library provides a rich set of modules for building production voice systems. These modules handle concerns that go beyond basic STT/TTS, including session management, transcripts, budgets, and compliance.

Voice Agent Builder (std.voice.agent) #

The voice agent builder provides a higher-level API for constructing voice agents with configurable policies and behaviors:

neam
import std::voice::agent;

let config = agent::create_voice_agent({
  "agent": "voice_bot",
  "pipeline": "assistant",
  "max_turns": 20,
  "greeting": "Hello! How can I help you today?",
  "farewell": "Goodbye! Have a great day.",
  "idle_timeout_ms": 30000,
  "error_message": "I'm sorry, I didn't catch that. Could you repeat?"
});

Session Management (std.voice.session) #

Voice sessions track the lifecycle of a conversation, including connection state, turn history, and accumulated context:

neam
import std::voice::session;

let sess = session::create_session("RealtimeAssistant");
let status = session::get_status(sess);       // "connected", "idle", "active"
let duration = session::get_duration(sess);   // elapsed time in ms
let turn_count = session::get_turn_count(sess);
session::end_session(sess);

Transcript Handling (std.voice.transcript) #

The transcript module captures and structures the full conversation history, useful for logging, analysis, and compliance:

neam
import std::voice::transcript;

let tx = transcript::create_transcript();
tx = transcript::add_turn(tx, "user", "What's the weather today?");
tx = transcript::add_turn(tx, "agent", "It's sunny and 72 degrees.");

let full_text = transcript::to_text(tx);
let json_export = transcript::to_json(tx);

Voice Budgets (std.voice.budget) #

Voice interactions can be expensive. The budget module tracks costs and enforces limits across token usage, API calls, and wall-clock time:

neam
import std::voice::budget;

let voice_budget = budget::create_budget({
  "max_cost_usd": 5.00,
  "max_turns": 50,
  "max_duration_ms": 600000
});

let remaining = budget::get_remaining(voice_budget);
let is_exceeded = budget::is_exceeded(voice_budget);

Additional Voice Modules #

Module Purpose
std.voice.turn Turn-level management (start, end, metadata)
std.voice.policy Conversation policies (max silence, retry limits)
std.voice.metrics Latency, word count, turn duration tracking
std.voice.audit Audit logging for compliance and debugging
std.voice.consent User consent management for recording
std.voice.taint Data tainting for sensitive audio content
std.voice.tracing Distributed tracing for voice pipelines
std.voice.playback Audio playback utilities

16.15 Audio Utilities #

The standard library provides low-level audio utilities (std.audio) for working with audio data directly, which is useful for custom voice pipelines and pre/post processing.

Audio Resampling #

When mixing providers that use different sample rates (e.g., 16kHz whisper.cpp input with 24kHz OpenAI Realtime output), use the built-in resample function:

neam
{
  let resampled = voice_audio_resample(pcm_data, 16000, 24000);
}

Audio Buffer and Codec Modules #

Module Purpose
std.audio.buffer Ring buffers for streaming audio, append/read/drain operations
std.audio.codec Encode and decode between PCM, WAV, MP3, Opus, FLAC
std.audio.stream Streaming audio source/sink abstractions
std.audio.meter Real-time audio level metering (RMS, peak, silence detection)
neam
import std::audio::meter;

let level = meter::get_rms(audio_chunk);
let is_silent = meter::is_silent(audio_chunk, -40.0);  // threshold in dB

Audio metering is particularly useful for implementing custom VAD logic when the built-in server/client/manual modes do not fit your use case.


16.16 Speech Configuration in the Standard Library #

The std.speech modules provide detailed configuration for STT and TTS beyond what the voice and realtime_voice declarations expose directly.

STT Configuration (std.speech.stt) #

neam
import std::speech::stt;

let stt_config = stt::create_config({
  "language": "en-US",
  "punctuation": true,
  "profanity_filter": false,
  "diarization": true,      // speaker identification
  "word_timestamps": true
});

Endpointing modes control how the STT engine detects the end of an utterance:

Mode Description
"aggressive" Quick end-of-speech detection, responsive but may cut off
"normal" Balanced detection (default)
"conservative" Waits longer before ending, good for complex speech
"manual" No automatic detection; caller controls

TTS Configuration (std.speech.tts) #

neam
import std::speech::tts;

let tts_config = tts::create_config({
  "voice": "nova",
  "language": "en-US",
  "speed": 1.0,
  "pitch": 0.0,
  "volume": 1.0,
  "style": "conversational"
});

The TTS module also supports SSML (Speech Synthesis Markup Language) for fine-grained control over pronunciation, pauses, and emphasis:

neam
let ssml_text = "<speak>Welcome to <emphasis level='strong'>Neam</emphasis>. " +
                "<break time='500ms'/> How can I help you today?</speak>";
📝 Note

SSML support varies by provider. OpenAI TTS uses tts_instructions instead of SSML markup. Gemini and ElevenLabs support SSML natively.


Summary #

In this chapter you learned:


Exercises #

Exercise 16.1: Basic Voice Pipeline #

Create a batch voice pipeline using OpenAI Whisper for STT and OpenAI TTS. Record a short audio clip (or use a sample WAV file), run it through the pipeline, and verify that the transcription, response, and synthesized audio are all correct.

Exercise 16.2: Local Voice Setup #

Set up a fully local voice pipeline using whisper.cpp for STT, Ollama for the agent, and Kokoro for TTS. Document the setup process, including starting each service. Run the same audio clip from Exercise 16.1 and compare the results.

Exercise 16.3: Cross-Provider Pipeline #

Create a pipeline that uses Gemini for STT, OpenAI for the agent, and Kokoro for TTS. Test it with audio in a non-English language (e.g., Spanish, French, or Japanese). How does the transcription quality compare to OpenAI Whisper for that language?

Exercise 16.4: VAD Sensitivity Tuning #

Create two real-time voice configurations with different VAD settings: - "Sensitive": vad_threshold: "0.3", silence_duration_ms: "300" - "Relaxed": vad_threshold: "0.8", silence_duration_ms: "1000"

Test both with the same conversation. Document the differences in user experience. Under what conditions would you prefer each setting?

Exercise 16.5: Voice-Enabled RAG Bot #

Build a voice agent that uses a RAG knowledge base to answer questions about a topic of your choice. The pipeline should: (a) transcribe a question from audio, (b) retrieve relevant context, (c) generate an answer, and (d) synthesize the answer to audio. Test with at least three different questions.

Exercise 16.6: Real-Time Conversation #

Using OpenAI Realtime or Gemini Live, implement a real-time voice session that: 1. Connects to the provider. 2. Sends a greeting via text. 3. Handles the response events. 4. Sends a follow-up question. 5. Demonstrates barge-in (interrupt the agent mid-response). 6. Cleanly closes the session.

Document the latency you observe at each step.

Start typing to search...