Real-Time Voice AI Assistant with Function Calling & Bidirectional Audio Streaming
A fitness and nutrition platform needed a voice-first AI assistant that could respond to users in real-time with natural conversation, execute domain-specific calculations (meal adjustments, calorie tracking), and speak responses back β all with sub-second latency for a truly conversational experience.
Bincangkan Projek Anda
Cabaran
Building a production-grade voice AI assistant presented unique real-time engineering challenges:
- Latency β Traditional speech-to-text β LLM β text-to-speech pipelines added 3-5 seconds of delay, breaking conversational flow
- Function Calling β The assistant needed to execute domain logic (nutrition calculations, meal plan adjustments) mid-conversation, not just chat
- Audio Streaming β Bidirectional audio needed to flow continuously without buffering gaps or echo issues
- Context Awareness β The assistant needed to maintain conversation context across turns while handling interruptions
- Multi-Language β Users spoke in different languages and expected responses in the same language
- Session Isolation β Each voice session needed independent state management without cross-talk
Penyelesaian Kami
We built a real-time voice AI assistant powered by Google's Gemini Live API with native audio capabilities, custom function calling for domain-specific calculations, and a React frontend with WebSocket-based audio streaming.
Architecture
- AI Model: Gemini with native audio input/output and function calling
- Backend: Python/FastAPI with WebSocket endpoint for bidirectional audio
- Audio Pipeline: PyAudio for microphone/speaker I/O with real-time streaming
- Frontend: React with Vite and Tailwind CSS for session control UI
- Communication: WebSocket for low-latency JSON messaging and binary audio transport
- Multimodal: Optional camera and screen capture for visual context
Real-Time Audio Pipeline
Bidirectional Streaming
The system maintains continuous audio streams in both directions:
- Input: Microphone audio captured at 16kHz mono, chunked into small frames, and streamed to the AI model in real-time
- Output: AI-generated speech received at 24kHz and played through speakers immediately
- No Batching: Audio chunks are sent as captured β no accumulation delays
- Interrupt Handling: User can interrupt the assistant mid-response naturally
Audio Processing
- 16-bit PCM format for both input and output
- Separate sample rates optimized for speech (16kHz capture, 24kHz playback)
- Small buffer sizes for minimal latency
- Continuous streaming with no start/stop gaps between turns
Function Calling Integration
How It Works
The AI model can invoke local Python functions mid-conversation when domain-specific calculations are needed:
- User speaks a request (e.g., "I missed lunch today")
- AI model transcribes and understands the intent
- Model determines a function call is needed and sends a structured request
- Backend extracts function name, arguments, and call ID
- Local function executes the domain calculation
- Result sent back to the model as a structured response
- Model generates a natural language voice response incorporating the result
Domain Functions
The system supports nutrition-focused function calling for scenarios like:
- Missed Meals β Redistributes missed macronutrients across remaining meals
- Unplanned Food β Adjusts upcoming meals to compensate for unexpected intake
- Meal Substitutions β Swaps ingredients while maintaining macro targets
- Activity Tracking β Estimates calorie burn and adjusts nutrition buffer
Each function uses a macro database with per-food nutritional profiles and performs dynamic calculations with slight stochastic variation for natural-feeling responses.
Execution Safety
- Microphone input is paused during function execution to prevent overlap
- Pending audio frames are dropped to avoid stale context
- Error responses are sent back gracefully if function execution fails
- Normal streaming resumes immediately after function completion
Backend Architecture
FastAPI WebSocket Server
- Single WebSocket endpoint for all client communication
- Session lifecycle management (start, stop, ping/pong health checks)
- One active session at a time with session locking
- CORS middleware for development environments
- Health check endpoint for monitoring
Session Management
- Sessions are created on client connect with mode selection (audio-only, camera, or screen)
- Background async tasks handle audio capture, processing, and playback concurrently
- Graceful disconnection with resource cleanup
- API key validation and error propagation
Multimodal Input (Optional)
Beyond voice, the system supports optional visual context:
- Camera Mode β Streams webcam frames (1fps) for visual context in conversations
- Screen Mode β Captures screen content for discussing on-screen information
- Images are resized and compressed before transmission
- Visual context enhances the AI's ability to provide relevant responses
Frontend Interface
- Session Control β Start/stop listening with clear status indicators
- Status Display β Real-time connection and session state (idle, connecting, active, error)
- Theme Support β Light/dark mode with persistence
- Guided Walkthrough β Step-by-step demo for first-time users
- WebSocket Management β Automatic reconnection logic
AI Model Configuration
- Native audio modality (no separate STT/TTS pipeline)
- Configurable voice selection from multiple preset voices
- System instructions defining assistant personality, response style, and language handling
- Tool definitions for all available functions with parameter schemas
- Automatic language detection with same-language response
Key Features
- Sub-Second Latency β Native audio model eliminates STT/TTS pipeline overhead
- Real-Time Bidirectional Audio β Continuous streaming with < 50ms per-chunk latency
- Function Calling β Domain-specific calculations executed mid-conversation
- Natural Interruption β Users can interrupt the assistant naturally without special commands
- Multi-Language β Automatic language detection with same-language responses
- Multimodal Input β Optional camera and screen context for visual understanding
- Session Management β Session lifecycle control with locking and resource cleanup
- Macro Calculations β Dynamic nutritional adjustments with per-food macro profiles
- Error Recovery β Graceful handling of function failures and network interruptions
- Extensible β New functions added by defining schema and handler β no architecture changes
Keputusan
Timbunan Teknologi
caseStudyDetail.more Kajian Kes
Terokai lebih banyak pelaksanaan teknikal kami
Pemprosesan Invois Berkuasa AI dengan OCR dan Integrasi QuickBooks
Sebuah perniagaan bersaiz sederhana yang memproses ratusan invois vendor setiap bulan perlu menghapuskan kemasukan data manual dengan mengekstrak data invois secara automatik menggunakan AI/OCR dan menyegerakkannya terus ke dalam QuickBooks untuk tujuan simpan kira dan penjejakan pembayaran.
Penyisipan Iklan Sisi Klien (CSAI) dengan Penghuraian Penanda SCTE-35 & Integrasi Pemain Berbilang Platform
Sebuah platform penstriman video perlu melaksanakan Client-Side Ad Insertion (CSAI) merentasi aplikasi web, mudah alih, dan TV bersambung β membolehkan pengalaman iklan yang diperibadikan pada peringkat peranti dengan sokongan interaksi iklan penuh (lapisan tindanan boleh klik, sepanduk pendamping, butang langkau) yang tidak dapat disediakan oleh penyisipan sisi pelayan.
Bersedia untuk Mentransformasi Perniagaan Anda?
Mari bincangkan bagaimana kami boleh mengaplikasikan penyelesaian serupa untuk cabaran anda.