Real-Time Voice AI Assistant with Function Calling & Bidirectional Audio Streaming
A fitness and nutrition platform needed a voice-first AI assistant that could respond to users in real-time with natural conversation, execute domain-specific calculations (meal adjustments, calorie tracking), and speak responses back — all with sub-second latency for a truly conversational experience.
Diskuter Dit Projekt
Udfordringen
Building a production-grade voice AI assistant presented unique real-time engineering challenges:
- Latency — Traditional speech-to-text → LLM → text-to-speech pipelines added 3-5 seconds of delay, breaking conversational flow
- Function Calling — The assistant needed to execute domain logic (nutrition calculations, meal plan adjustments) mid-conversation, not just chat
- Audio Streaming — Bidirectional audio needed to flow continuously without buffering gaps or echo issues
- Context Awareness — The assistant needed to maintain conversation context across turns while handling interruptions
- Multi-Language — Users spoke in different languages and expected responses in the same language
- Session Isolation — Each voice session needed independent state management without cross-talk
Vores Løsning
We built a real-time voice AI assistant powered by Google's Gemini Live API with native audio capabilities, custom function calling for domain-specific calculations, and a React frontend with WebSocket-based audio streaming.
Architecture
- AI Model: Gemini with native audio input/output and function calling
- Backend: Python/FastAPI with WebSocket endpoint for bidirectional audio
- Audio Pipeline: PyAudio for microphone/speaker I/O with real-time streaming
- Frontend: React with Vite and Tailwind CSS for session control UI
- Communication: WebSocket for low-latency JSON messaging and binary audio transport
- Multimodal: Optional camera and screen capture for visual context
Real-Time Audio Pipeline
Bidirectional Streaming
The system maintains continuous audio streams in both directions:
- Input: Microphone audio captured at 16kHz mono, chunked into small frames, and streamed to the AI model in real-time
- Output: AI-generated speech received at 24kHz and played through speakers immediately
- No Batching: Audio chunks are sent as captured — no accumulation delays
- Interrupt Handling: User can interrupt the assistant mid-response naturally
Audio Processing
- 16-bit PCM format for both input and output
- Separate sample rates optimized for speech (16kHz capture, 24kHz playback)
- Small buffer sizes for minimal latency
- Continuous streaming with no start/stop gaps between turns
Function Calling Integration
How It Works
The AI model can invoke local Python functions mid-conversation when domain-specific calculations are needed:
- User speaks a request (e.g., "I missed lunch today")
- AI model transcribes and understands the intent
- Model determines a function call is needed and sends a structured request
- Backend extracts function name, arguments, and call ID
- Local function executes the domain calculation
- Result sent back to the model as a structured response
- Model generates a natural language voice response incorporating the result
Domain Functions
The system supports nutrition-focused function calling for scenarios like:
- Missed Meals — Redistributes missed macronutrients across remaining meals
- Unplanned Food — Adjusts upcoming meals to compensate for unexpected intake
- Meal Substitutions — Swaps ingredients while maintaining macro targets
- Activity Tracking — Estimates calorie burn and adjusts nutrition buffer
Each function uses a macro database with per-food nutritional profiles and performs dynamic calculations with slight stochastic variation for natural-feeling responses.
Execution Safety
- Microphone input is paused during function execution to prevent overlap
- Pending audio frames are dropped to avoid stale context
- Error responses are sent back gracefully if function execution fails
- Normal streaming resumes immediately after function completion
Backend Architecture
FastAPI WebSocket Server
- Single WebSocket endpoint for all client communication
- Session lifecycle management (start, stop, ping/pong health checks)
- One active session at a time with session locking
- CORS middleware for development environments
- Health check endpoint for monitoring
Session Management
- Sessions are created on client connect with mode selection (audio-only, camera, or screen)
- Background async tasks handle audio capture, processing, and playback concurrently
- Graceful disconnection with resource cleanup
- API key validation and error propagation
Multimodal Input (Optional)
Beyond voice, the system supports optional visual context:
- Camera Mode — Streams webcam frames (1fps) for visual context in conversations
- Screen Mode — Captures screen content for discussing on-screen information
- Images are resized and compressed before transmission
- Visual context enhances the AI's ability to provide relevant responses
Frontend Interface
- Session Control — Start/stop listening with clear status indicators
- Status Display — Real-time connection and session state (idle, connecting, active, error)
- Theme Support — Light/dark mode with persistence
- Guided Walkthrough — Step-by-step demo for first-time users
- WebSocket Management — Automatic reconnection logic
AI Model Configuration
- Native audio modality (no separate STT/TTS pipeline)
- Configurable voice selection from multiple preset voices
- System instructions defining assistant personality, response style, and language handling
- Tool definitions for all available functions with parameter schemas
- Automatic language detection with same-language response
Key Features
- Sub-Second Latency — Native audio model eliminates STT/TTS pipeline overhead
- Real-Time Bidirectional Audio — Continuous streaming with < 50ms per-chunk latency
- Function Calling — Domain-specific calculations executed mid-conversation
- Natural Interruption — Users can interrupt the assistant naturally without special commands
- Multi-Language — Automatic language detection with same-language responses
- Multimodal Input — Optional camera and screen context for visual understanding
- Session Management — Session lifecycle control with locking and resource cleanup
- Macro Calculations — Dynamic nutritional adjustments with per-food macro profiles
- Error Recovery — Graceful handling of function failures and network interruptions
- Extensible — New functions added by defining schema and handler — no architecture changes
Resultater
Teknologistak
caseStudyDetail.more Casestudier
Udforsk flere af vores tekniske implementeringer
AI-drevet fakturabehandling med OCR og QuickBooks-integration
En mellemstor virksomhed, der månedligt behandler hundredvis af leverandørfakturaer, havde brug for at eliminere manuel dataindtastning ved automatisk at udtrække fakturadata ved hjælp af AI/OCR og synkronisere dem direkte til QuickBooks for bogføring og sporing af betalinger.
Klient-side annonceindsættelse (CSAI) med SCTE-35-markørparsing og integration af afspillere på flere platforme
En videostreamingplatform skulle implementere klient-side annonceindsættelse (CSAI) på tværs af web-, mobil- og connected TV-apps – hvilket muliggjorde personaliserede annonceringer på enhedsniveau med fuld support for annonceinteraktion (klikbare overlays, følgebannere, skip-knapper), som server-side indsættelse ikke kan tilbyde.
Klar til at Transformere Din Virksomhed?
Lad os drøfte, hvordan vi kan anvende lignende løsninger til dine udfordringer.