Back to Case Studies
AI Voice Agents

Real-Time Voice AI Assistant with Function Calling & Bidirectional Audio Streaming

A fitness and nutrition platform needed a voice-first AI assistant that could respond to users in real-time with natural conversation, execute domain-specific calculations (meal adjustments, calorie tracking), and speak responses back — all with sub-second latency for a truly conversational experience.

Discuss Your Project
realtime-voice-ai-assistant.webp
AI Voice Agents
Domain
10
Technologies
5
Key Results
Delivered
Status

The Challenge

Building a production-grade voice AI assistant presented unique real-time engineering challenges:

  • Latency — Traditional speech-to-text → LLM → text-to-speech pipelines added 3-5 seconds of delay, breaking conversational flow
  • Function Calling — The assistant needed to execute domain logic (nutrition calculations, meal plan adjustments) mid-conversation, not just chat
  • Audio Streaming — Bidirectional audio needed to flow continuously without buffering gaps or echo issues
  • Context Awareness — The assistant needed to maintain conversation context across turns while handling interruptions
  • Multi-Language — Users spoke in different languages and expected responses in the same language
  • Session Isolation — Each voice session needed independent state management without cross-talk

Our Solution

We built a real-time voice AI assistant powered by Google's Gemini Live API with native audio capabilities, custom function calling for domain-specific calculations, and a React frontend with WebSocket-based audio streaming.

Architecture

  • AI Model: Gemini with native audio input/output and function calling
  • Backend: Python/FastAPI with WebSocket endpoint for bidirectional audio
  • Audio Pipeline: PyAudio for microphone/speaker I/O with real-time streaming
  • Frontend: React with Vite and Tailwind CSS for session control UI
  • Communication: WebSocket for low-latency JSON messaging and binary audio transport
  • Multimodal: Optional camera and screen capture for visual context

Real-Time Audio Pipeline

Bidirectional Streaming

The system maintains continuous audio streams in both directions:

  • Input: Microphone audio captured at 16kHz mono, chunked into small frames, and streamed to the AI model in real-time
  • Output: AI-generated speech received at 24kHz and played through speakers immediately
  • No Batching: Audio chunks are sent as captured — no accumulation delays
  • Interrupt Handling: User can interrupt the assistant mid-response naturally

Audio Processing

  • 16-bit PCM format for both input and output
  • Separate sample rates optimized for speech (16kHz capture, 24kHz playback)
  • Small buffer sizes for minimal latency
  • Continuous streaming with no start/stop gaps between turns

Function Calling Integration

How It Works

The AI model can invoke local Python functions mid-conversation when domain-specific calculations are needed:

  1. User speaks a request (e.g., "I missed lunch today")
  2. AI model transcribes and understands the intent
  3. Model determines a function call is needed and sends a structured request
  4. Backend extracts function name, arguments, and call ID
  5. Local function executes the domain calculation
  6. Result sent back to the model as a structured response
  7. Model generates a natural language voice response incorporating the result

Domain Functions

The system supports nutrition-focused function calling for scenarios like:

  • Missed Meals — Redistributes missed macronutrients across remaining meals
  • Unplanned Food — Adjusts upcoming meals to compensate for unexpected intake
  • Meal Substitutions — Swaps ingredients while maintaining macro targets
  • Activity Tracking — Estimates calorie burn and adjusts nutrition buffer

Each function uses a macro database with per-food nutritional profiles and performs dynamic calculations with slight stochastic variation for natural-feeling responses.

Execution Safety

  • Microphone input is paused during function execution to prevent overlap
  • Pending audio frames are dropped to avoid stale context
  • Error responses are sent back gracefully if function execution fails
  • Normal streaming resumes immediately after function completion

Backend Architecture

FastAPI WebSocket Server

  • Single WebSocket endpoint for all client communication
  • Session lifecycle management (start, stop, ping/pong health checks)
  • One active session at a time with session locking
  • CORS middleware for development environments
  • Health check endpoint for monitoring

Session Management

  • Sessions are created on client connect with mode selection (audio-only, camera, or screen)
  • Background async tasks handle audio capture, processing, and playback concurrently
  • Graceful disconnection with resource cleanup
  • API key validation and error propagation

Multimodal Input (Optional)

Beyond voice, the system supports optional visual context:

  • Camera Mode — Streams webcam frames (1fps) for visual context in conversations
  • Screen Mode — Captures screen content for discussing on-screen information
  • Images are resized and compressed before transmission
  • Visual context enhances the AI's ability to provide relevant responses

Frontend Interface

  • Session Control — Start/stop listening with clear status indicators
  • Status Display — Real-time connection and session state (idle, connecting, active, error)
  • Theme Support — Light/dark mode with persistence
  • Guided Walkthrough — Step-by-step demo for first-time users
  • WebSocket Management — Automatic reconnection logic

AI Model Configuration

  • Native audio modality (no separate STT/TTS pipeline)
  • Configurable voice selection from multiple preset voices
  • System instructions defining assistant personality, response style, and language handling
  • Tool definitions for all available functions with parameter schemas
  • Automatic language detection with same-language response

Key Features

  1. Sub-Second Latency — Native audio model eliminates STT/TTS pipeline overhead
  2. Real-Time Bidirectional Audio — Continuous streaming with < 50ms per-chunk latency
  3. Function Calling — Domain-specific calculations executed mid-conversation
  4. Natural Interruption — Users can interrupt the assistant naturally without special commands
  5. Multi-Language — Automatic language detection with same-language responses
  6. Multimodal Input — Optional camera and screen context for visual understanding
  7. Session Management — Session lifecycle control with locking and resource cleanup
  8. Macro Calculations — Dynamic nutritional adjustments with per-food macro profiles
  9. Error Recovery — Graceful handling of function failures and network interruptions
  10. Extensible — New functions added by defining schema and handler — no architecture changes

Results

First Response Latency: 500-1200ms (vs. 3-5s for traditional STT→LLM→TTS pipelines)
Session Start Time: ~200ms
Audio Streaming Latency: < 50ms per chunk (real-time)
Function Execution: Domain calculations completed within the conversation flow
User Experience: Natural conversational feel with interrupt support

Technology Stack

Google Gemini Live APIPythonFastAPIWebSocketPyAudioReactViteTailwind CSSOpenCVPillow

Frequently Asked Questions

MicrocosmWorks engineered a bidirectional WebSocket audio pipeline that streams user speech to the ASR engine in real-time chunks, begins LLM inference before the user finishes speaking using streaming transcription, and starts text-to-speech synthesis on the first tokens of the response. This pipelining approach achieves response latencies under 800ms from end-of-speech to first audio output, which users perceive as natural conversational turn-taking.

MicrocosmWorks integrated structured function calling where the LLM can invoke predefined APIs like booking appointments, querying databases, or triggering workflows based on the conversation context, with the results spoken back to the caller naturally. The system includes confirmation flows for high-stakes actions like payments or cancellations, where the assistant verbally confirms the details and waits for the caller's explicit approval before executing.

Yes, MicrocosmWorks implemented barge-in detection that allows callers to interrupt the assistant mid-response, immediately stopping audio playback and processing the new utterance. The ASR pipeline includes noise cancellation preprocessing and supports models fine-tuned on diverse accents, achieving over 90% transcription accuracy in noisy environments typical of phone calls from cars, offices, or public spaces.

MicrocosmWorks built the voice assistant with SIP trunk integration and Twilio connectivity, supporting deployment on existing business phone numbers, IVR systems, and contact center platforms without requiring callers to install any app or use a special interface. The platform handles call routing, queue management, and warm transfers to human agents when the AI determines a conversation requires human expertise.

MicrocosmWorks develops custom voice AI assistants at rates between $30-$50/hr, and while the upfront build cost exceeds managed platform setup fees, a custom solution avoids the per-minute usage charges that platforms like Dialogflow CX or Amazon Lex impose, which become significant at high call volumes. Custom builds also give you full control over the LLM, voice persona, and function calling logic, which managed platforms constrain with rigid dialog flow paradigms.

Have a Similar Project in Mind?

Let's discuss how we can build a solution tailored to your needs.

Contact UsSchedule Appointment