How does a real-time voice AI assistant handle the latency requirements of natural conversation?

MicrocosmWorks engineered a bidirectional WebSocket audio pipeline that streams user speech to the ASR engine in real-time chunks, begins LLM inference before the user finishes speaking using streaming transcription, and starts text-to-speech synthesis on the first tokens of the response. This pipelining approach achieves response latencies under 800ms from end-of-speech to first audio output, which users perceive as natural conversational turn-taking.

How does function calling work in a voice AI assistant, and what kinds of actions can it perform?

MicrocosmWorks integrated structured function calling where the LLM can invoke predefined APIs like booking appointments, querying databases, or triggering workflows based on the conversation context, with the results spoken back to the caller naturally. The system includes confirmation flows for high-stakes actions like payments or cancellations, where the assistant verbally confirms the details and waits for the caller's explicit approval before executing.

Can the voice AI assistant handle interruptions, background noise, and accented speech reliably?

Yes, MicrocosmWorks implemented barge-in detection that allows callers to interrupt the assistant mid-response, immediately stopping audio playback and processing the new utterance. The ASR pipeline includes noise cancellation preprocessing and supports models fine-tuned on diverse accents, achieving over 90% transcription accuracy in noisy environments typical of phone calls from cars, offices, or public spaces.

What telephony integration options are available for deploying a voice AI assistant on existing phone systems?

MicrocosmWorks built the voice assistant with SIP trunk integration and Twilio connectivity, supporting deployment on existing business phone numbers, IVR systems, and contact center platforms without requiring callers to install any app or use a special interface. The platform handles call routing, queue management, and warm transfers to human agents when the AI determines a conversation requires human expertise.

What does it cost to build a custom real-time voice AI assistant compared to using platforms like Dialogflow or Amazon Lex?

MicrocosmWorks develops custom voice AI assistants at rates between $30-$50/hr, and while the upfront build cost exceeds managed platform setup fees, a custom solution avoids the per-minute usage charges that platforms like Dialogflow CX or Amazon Lex impose, which become significant at high call volumes. Custom builds also give you full control over the LLM, voice persona, and function calling logic, which managed platforms constrain with rigid dialog flow paradigms.

Real-Time Voice AI Assistant with Function Calling & Bidi...

Real-Time Voice AI Assistant with Function Calling & Bidirectional Audio Streaming

A fitness and nutrition platform needed a voice-first AI assistant that could respond to users in real-time with natural conversation, execute domain-specific calculations (meal adjustments, calorie tracking), and speak responses back — all with sub-second latency for a truly conversational experience.

Diskuter Dit Projekt

Building a production-grade voice AI assistant presented unique real-time engineering challenges:

Latency — Traditional speech-to-text → LLM → text-to-speech pipelines added 3-5 seconds of delay, breaking conversational flow
Function Calling — The assistant needed to execute domain logic (nutrition calculations, meal plan adjustments) mid-conversation, not just chat
Audio Streaming — Bidirectional audio needed to flow continuously without buffering gaps or echo issues
Context Awareness — The assistant needed to maintain conversation context across turns while handling interruptions
Multi-Language — Users spoke in different languages and expected responses in the same language
Session Isolation — Each voice session needed independent state management without cross-talk

We built a real-time voice AI assistant powered by Google's Gemini Live API with native audio capabilities, custom function calling for domain-specific calculations, and a React frontend with WebSocket-based audio streaming.

Architecture

AI Model: Gemini with native audio input/output and function calling
Backend: Python/FastAPI with WebSocket endpoint for bidirectional audio
Audio Pipeline: PyAudio for microphone/speaker I/O with real-time streaming
Frontend: React with Vite and Tailwind CSS for session control UI
Communication: WebSocket for low-latency JSON messaging and binary audio transport
Multimodal: Optional camera and screen capture for visual context

Real-Time Audio Pipeline

Bidirectional Streaming

The system maintains continuous audio streams in both directions:

Input: Microphone audio captured at 16kHz mono, chunked into small frames, and streamed to the AI model in real-time
Output: AI-generated speech received at 24kHz and played through speakers immediately
No Batching: Audio chunks are sent as captured — no accumulation delays
Interrupt Handling: User can interrupt the assistant mid-response naturally

Audio Processing

16-bit PCM format for both input and output
Separate sample rates optimized for speech (16kHz capture, 24kHz playback)
Small buffer sizes for minimal latency
Continuous streaming with no start/stop gaps between turns

Function Calling Integration

How It Works

The AI model can invoke local Python functions mid-conversation when domain-specific calculations are needed:

User speaks a request (e.g., "I missed lunch today")
AI model transcribes and understands the intent
Model determines a function call is needed and sends a structured request
Backend extracts function name, arguments, and call ID
Local function executes the domain calculation
Result sent back to the model as a structured response
Model generates a natural language voice response incorporating the result

Domain Functions

The system supports nutrition-focused function calling for scenarios like:

Missed Meals — Redistributes missed macronutrients across remaining meals
Unplanned Food — Adjusts upcoming meals to compensate for unexpected intake
Meal Substitutions — Swaps ingredients while maintaining macro targets
Activity Tracking — Estimates calorie burn and adjusts nutrition buffer

Each function uses a macro database with per-food nutritional profiles and performs dynamic calculations with slight stochastic variation for natural-feeling responses.

Execution Safety

Microphone input is paused during function execution to prevent overlap
Pending audio frames are dropped to avoid stale context
Error responses are sent back gracefully if function execution fails
Normal streaming resumes immediately after function completion

Backend Architecture

FastAPI WebSocket Server

Single WebSocket endpoint for all client communication
Session lifecycle management (start, stop, ping/pong health checks)
One active session at a time with session locking
CORS middleware for development environments
Health check endpoint for monitoring

Session Management

Sessions are created on client connect with mode selection (audio-only, camera, or screen)
Background async tasks handle audio capture, processing, and playback concurrently
Graceful disconnection with resource cleanup
API key validation and error propagation

Multimodal Input (Optional)

Beyond voice, the system supports optional visual context:

Camera Mode — Streams webcam frames (1fps) for visual context in conversations
Screen Mode — Captures screen content for discussing on-screen information
Images are resized and compressed before transmission
Visual context enhances the AI's ability to provide relevant responses

Frontend Interface

Session Control — Start/stop listening with clear status indicators
Status Display — Real-time connection and session state (idle, connecting, active, error)
Theme Support — Light/dark mode with persistence
Guided Walkthrough — Step-by-step demo for first-time users
WebSocket Management — Automatic reconnection logic

AI Model Configuration

Native audio modality (no separate STT/TTS pipeline)
Configurable voice selection from multiple preset voices
System instructions defining assistant personality, response style, and language handling
Tool definitions for all available functions with parameter schemas
Automatic language detection with same-language response

Key Features

Sub-Second Latency — Native audio model eliminates STT/TTS pipeline overhead
Real-Time Bidirectional Audio — Continuous streaming with < 50ms per-chunk latency
Function Calling — Domain-specific calculations executed mid-conversation
Natural Interruption — Users can interrupt the assistant naturally without special commands
Multi-Language — Automatic language detection with same-language responses
Multimodal Input — Optional camera and screen context for visual understanding
Session Management — Session lifecycle control with locking and resource cleanup
Macro Calculations — Dynamic nutritional adjustments with per-food macro profiles
Error Recovery — Graceful handling of function failures and network interruptions
Extensible — New functions added by defining schema and handler — no architecture changes

Real-Time Voice AI Assistant with Function Calling & Bidirectional Audio Streaming

Udfordringen

Vores Løsning

Architecture

Real-Time Audio Pipeline

Bidirectional Streaming

Audio Processing

Function Calling Integration

How It Works

Domain Functions

Execution Safety

Backend Architecture

FastAPI WebSocket Server

Session Management

Multimodal Input (Optional)

Frontend Interface

AI Model Configuration

Key Features

Resultater

Teknologistak

caseStudyDetail.more Casestudier

AI-drevet fakturabehandling med OCR og QuickBooks-integration

Klient-side annonceindsættelse (CSAI) med SCTE-35-markørparsing og integration af afspillere på flere platforme

Klar til at Transformere Din Virksomhed?

AI-drevet platform til scraping og generering af blogindhold

Ofte stillede spørgsmål