Real-Time Voice AI Assistant with Function Calling & Bidirectional Audio Streaming
A fitness and nutrition platform needed a voice-first AI assistant that could respond to users in real-time with natural conversation, execute domain-specific calculations (meal adjustments, calorie tracking), and speak responses back β all with sub-second latency for a truly conversational experience.
Discuss Your Project
The Challenge
Building a production-grade voice AI assistant presented unique real-time engineering challenges:
- Latency β Traditional speech-to-text β LLM β text-to-speech pipelines added 3-5 seconds of delay, breaking conversational flow
- Function Calling β The assistant needed to execute domain logic (nutrition calculations, meal plan adjustments) mid-conversation, not just chat
- Audio Streaming β Bidirectional audio needed to flow continuously without buffering gaps or echo issues
- Context Awareness β The assistant needed to maintain conversation context across turns while handling interruptions
- Multi-Language β Users spoke in different languages and expected responses in the same language
- Session Isolation β Each voice session needed independent state management without cross-talk
Our Solution
We built a real-time voice AI assistant powered by Google's Gemini Live API with native audio capabilities, custom function calling for domain-specific calculations, and a React frontend with WebSocket-based audio streaming.
Architecture
- AI Model: Gemini with native audio input/output and function calling
- Backend: Python/FastAPI with WebSocket endpoint for bidirectional audio
- Audio Pipeline: PyAudio for microphone/speaker I/O with real-time streaming
- Frontend: React with Vite and Tailwind CSS for session control UI
- Communication: WebSocket for low-latency JSON messaging and binary audio transport
- Multimodal: Optional camera and screen capture for visual context
Real-Time Audio Pipeline
Bidirectional Streaming
The system maintains continuous audio streams in both directions:
- Input: Microphone audio captured at 16kHz mono, chunked into small frames, and streamed to the AI model in real-time
- Output: AI-generated speech received at 24kHz and played through speakers immediately
- No Batching: Audio chunks are sent as captured β no accumulation delays
- Interrupt Handling: User can interrupt the assistant mid-response naturally
Audio Processing
- 16-bit PCM format for both input and output
- Separate sample rates optimized for speech (16kHz capture, 24kHz playback)
- Small buffer sizes for minimal latency
- Continuous streaming with no start/stop gaps between turns
Function Calling Integration
How It Works
The AI model can invoke local Python functions mid-conversation when domain-specific calculations are needed:
- User speaks a request (e.g., "I missed lunch today")
- AI model transcribes and understands the intent
- Model determines a function call is needed and sends a structured request
- Backend extracts function name, arguments, and call ID
- Local function executes the domain calculation
- Result sent back to the model as a structured response
- Model generates a natural language voice response incorporating the result
Domain Functions
The system supports nutrition-focused function calling for scenarios like:
- Missed Meals β Redistributes missed macronutrients across remaining meals
- Unplanned Food β Adjusts upcoming meals to compensate for unexpected intake
- Meal Substitutions β Swaps ingredients while maintaining macro targets
- Activity Tracking β Estimates calorie burn and adjusts nutrition buffer
Each function uses a macro database with per-food nutritional profiles and performs dynamic calculations with slight stochastic variation for natural-feeling responses.
Execution Safety
- Microphone input is paused during function execution to prevent overlap
- Pending audio frames are dropped to avoid stale context
- Error responses are sent back gracefully if function execution fails
- Normal streaming resumes immediately after function completion
Backend Architecture
FastAPI WebSocket Server
- Single WebSocket endpoint for all client communication
- Session lifecycle management (start, stop, ping/pong health checks)
- One active session at a time with session locking
- CORS middleware for development environments
- Health check endpoint for monitoring
Session Management
- Sessions are created on client connect with mode selection (audio-only, camera, or screen)
- Background async tasks handle audio capture, processing, and playback concurrently
- Graceful disconnection with resource cleanup
- API key validation and error propagation
Multimodal Input (Optional)
Beyond voice, the system supports optional visual context:
- Camera Mode β Streams webcam frames (1fps) for visual context in conversations
- Screen Mode β Captures screen content for discussing on-screen information
- Images are resized and compressed before transmission
- Visual context enhances the AI's ability to provide relevant responses
Frontend Interface
- Session Control β Start/stop listening with clear status indicators
- Status Display β Real-time connection and session state (idle, connecting, active, error)
- Theme Support β Light/dark mode with persistence
- Guided Walkthrough β Step-by-step demo for first-time users
- WebSocket Management β Automatic reconnection logic
AI Model Configuration
- Native audio modality (no separate STT/TTS pipeline)
- Configurable voice selection from multiple preset voices
- System instructions defining assistant personality, response style, and language handling
- Tool definitions for all available functions with parameter schemas
- Automatic language detection with same-language response
Key Features
- Sub-Second Latency β Native audio model eliminates STT/TTS pipeline overhead
- Real-Time Bidirectional Audio β Continuous streaming with < 50ms per-chunk latency
- Function Calling β Domain-specific calculations executed mid-conversation
- Natural Interruption β Users can interrupt the assistant naturally without special commands
- Multi-Language β Automatic language detection with same-language responses
- Multimodal Input β Optional camera and screen context for visual understanding
- Session Management β Session lifecycle control with locking and resource cleanup
- Macro Calculations β Dynamic nutritional adjustments with per-food macro profiles
- Error Recovery β Graceful handling of function failures and network interruptions
- Extensible β New functions added by defining schema and handler β no architecture changes
Results
Technology Stack
caseStudyDetail.more Case Studies
Explore more of our technical implementations
AI-Powered Blog Content Scraping & Generation Platform
A media company needed an intelligent content platform that could automate blog content creation by scraping existing web content, analyzing it using AI, and generating original, SEO-optimized blog posts from the extracted data.
Automated B2B Supplier Data Collection Platform with Anti-Detection & IP Rotation
A sourcing team needed to build a comprehensive supplier database across 19+ product categories and 50+ countries by collecting structured business data from B2B marketplace platforms β at scale, reliably, and without being blocked.
Ready to Transform Your Business?
Let's discuss how we can apply similar solutions to your challenges.