Real-Time Voice AI Assistant with Function Calling & Bidirectional Audio Streaming
A fitness and nutrition platform needed a voice-first AI assistant that could respond to users in real-time with natural conversation, execute domain-specific calculations (meal adjustments, calorie tracking), and speak responses back — all with sub-second latency for a truly conversational experience.
Keskustele Projektistasi
Haaste
Building a production-grade voice AI assistant presented unique real-time engineering challenges:
- Latency — Traditional speech-to-text → LLM → text-to-speech pipelines added 3-5 seconds of delay, breaking conversational flow
- Function Calling — The assistant needed to execute domain logic (nutrition calculations, meal plan adjustments) mid-conversation, not just chat
- Audio Streaming — Bidirectional audio needed to flow continuously without buffering gaps or echo issues
- Context Awareness — The assistant needed to maintain conversation context across turns while handling interruptions
- Multi-Language — Users spoke in different languages and expected responses in the same language
- Session Isolation — Each voice session needed independent state management without cross-talk
Meidän Ratkaisumme
We built a real-time voice AI assistant powered by Google's Gemini Live API with native audio capabilities, custom function calling for domain-specific calculations, and a React frontend with WebSocket-based audio streaming.
Architecture
- AI Model: Gemini with native audio input/output and function calling
- Backend: Python/FastAPI with WebSocket endpoint for bidirectional audio
- Audio Pipeline: PyAudio for microphone/speaker I/O with real-time streaming
- Frontend: React with Vite and Tailwind CSS for session control UI
- Communication: WebSocket for low-latency JSON messaging and binary audio transport
- Multimodal: Optional camera and screen capture for visual context
Real-Time Audio Pipeline
Bidirectional Streaming
The system maintains continuous audio streams in both directions:
- Input: Microphone audio captured at 16kHz mono, chunked into small frames, and streamed to the AI model in real-time
- Output: AI-generated speech received at 24kHz and played through speakers immediately
- No Batching: Audio chunks are sent as captured — no accumulation delays
- Interrupt Handling: User can interrupt the assistant mid-response naturally
Audio Processing
- 16-bit PCM format for both input and output
- Separate sample rates optimized for speech (16kHz capture, 24kHz playback)
- Small buffer sizes for minimal latency
- Continuous streaming with no start/stop gaps between turns
Function Calling Integration
How It Works
The AI model can invoke local Python functions mid-conversation when domain-specific calculations are needed:
- User speaks a request (e.g., "I missed lunch today")
- AI model transcribes and understands the intent
- Model determines a function call is needed and sends a structured request
- Backend extracts function name, arguments, and call ID
- Local function executes the domain calculation
- Result sent back to the model as a structured response
- Model generates a natural language voice response incorporating the result
Domain Functions
The system supports nutrition-focused function calling for scenarios like:
- Missed Meals — Redistributes missed macronutrients across remaining meals
- Unplanned Food — Adjusts upcoming meals to compensate for unexpected intake
- Meal Substitutions — Swaps ingredients while maintaining macro targets
- Activity Tracking — Estimates calorie burn and adjusts nutrition buffer
Each function uses a macro database with per-food nutritional profiles and performs dynamic calculations with slight stochastic variation for natural-feeling responses.
Execution Safety
- Microphone input is paused during function execution to prevent overlap
- Pending audio frames are dropped to avoid stale context
- Error responses are sent back gracefully if function execution fails
- Normal streaming resumes immediately after function completion
Backend Architecture
FastAPI WebSocket Server
- Single WebSocket endpoint for all client communication
- Session lifecycle management (start, stop, ping/pong health checks)
- One active session at a time with session locking
- CORS middleware for development environments
- Health check endpoint for monitoring
Session Management
- Sessions are created on client connect with mode selection (audio-only, camera, or screen)
- Background async tasks handle audio capture, processing, and playback concurrently
- Graceful disconnection with resource cleanup
- API key validation and error propagation
Multimodal Input (Optional)
Beyond voice, the system supports optional visual context:
- Camera Mode — Streams webcam frames (1fps) for visual context in conversations
- Screen Mode — Captures screen content for discussing on-screen information
- Images are resized and compressed before transmission
- Visual context enhances the AI's ability to provide relevant responses
Frontend Interface
- Session Control — Start/stop listening with clear status indicators
- Status Display — Real-time connection and session state (idle, connecting, active, error)
- Theme Support — Light/dark mode with persistence
- Guided Walkthrough — Step-by-step demo for first-time users
- WebSocket Management — Automatic reconnection logic
AI Model Configuration
- Native audio modality (no separate STT/TTS pipeline)
- Configurable voice selection from multiple preset voices
- System instructions defining assistant personality, response style, and language handling
- Tool definitions for all available functions with parameter schemas
- Automatic language detection with same-language response
Key Features
- Sub-Second Latency — Native audio model eliminates STT/TTS pipeline overhead
- Real-Time Bidirectional Audio — Continuous streaming with < 50ms per-chunk latency
- Function Calling — Domain-specific calculations executed mid-conversation
- Natural Interruption — Users can interrupt the assistant naturally without special commands
- Multi-Language — Automatic language detection with same-language responses
- Multimodal Input — Optional camera and screen context for visual understanding
- Session Management — Session lifecycle control with locking and resource cleanup
- Macro Calculations — Dynamic nutritional adjustments with per-food macro profiles
- Error Recovery — Graceful handling of function failures and network interruptions
- Extensible — New functions added by defining schema and handler — no architecture changes
Tulokset
Teknologiapino
caseStudyDetail.more Tapaustutkimukset
Tutustu lisää teknisiin toteutuksiimme
AI-pohjainen laskujen käsittely OCR:n ja QuickBooks-integraation avulla
Keskisuuri yritys, joka käsitteli satoja toimittajalaskuja kuukausittain, halusi poistaa manuaalisen tiedonsyötön poimimalla laskutiedot automaattisesti AI/OCR:n avulla ja synkronoimalla ne suoraan QuickBooks-järjestelmään kirjanpitoa ja maksujen seurantaa varten.
Asiakaspuolen mainosten upotus (CSAI) SCTE-35-merkkien jäsennyksellä ja monialustaisen soittimen integroinnilla
Videoiden suoratoistoalustan piti toteuttaa Client-Side Ad Insertion (CSAI) verkko-, mobiili- ja Connected TV -sovellusten yli — mahdollistaen personoidut, laitekohtaiset mainoskokemukset täydellä mainosinteraktion tuella (klikkaavat peittokuvat, kumppanibannerit, ohituspainikkeet), joita server-side insertion ei voi tarjota.
Valmis Muuttamaan Liiketoimintaasi?
Keskustellaan siitä, miten voimme soveltaa vastaavia ratkaisuja haasteisiisi.