Real-Time Voice AI Assistant with Function Calling & Bidirectional Audio Streaming
A fitness and nutrition platform needed a voice-first AI assistant that could respond to users in real-time with natural conversation, execute domain-specific calculations (meal adjustments, calorie tracking), and speak responses back โ all with sub-second latency for a truly conversational experience.
ํ๋ก์ ํธ ์๋ดํ๊ธฐ
๊ณผ์
Building a production-grade voice AI assistant presented unique real-time engineering challenges:
- Latency โ Traditional speech-to-text โ LLM โ text-to-speech pipelines added 3-5 seconds of delay, breaking conversational flow
- Function Calling โ The assistant needed to execute domain logic (nutrition calculations, meal plan adjustments) mid-conversation, not just chat
- Audio Streaming โ Bidirectional audio needed to flow continuously without buffering gaps or echo issues
- Context Awareness โ The assistant needed to maintain conversation context across turns while handling interruptions
- Multi-Language โ Users spoke in different languages and expected responses in the same language
- Session Isolation โ Each voice session needed independent state management without cross-talk
์ฐ๋ฆฌ์ ์๋ฃจ์
We built a real-time voice AI assistant powered by Google's Gemini Live API with native audio capabilities, custom function calling for domain-specific calculations, and a React frontend with WebSocket-based audio streaming.
Architecture
- AI Model: Gemini with native audio input/output and function calling
- Backend: Python/FastAPI with WebSocket endpoint for bidirectional audio
- Audio Pipeline: PyAudio for microphone/speaker I/O with real-time streaming
- Frontend: React with Vite and Tailwind CSS for session control UI
- Communication: WebSocket for low-latency JSON messaging and binary audio transport
- Multimodal: Optional camera and screen capture for visual context
Real-Time Audio Pipeline
Bidirectional Streaming
The system maintains continuous audio streams in both directions:
- Input: Microphone audio captured at 16kHz mono, chunked into small frames, and streamed to the AI model in real-time
- Output: AI-generated speech received at 24kHz and played through speakers immediately
- No Batching: Audio chunks are sent as captured โ no accumulation delays
- Interrupt Handling: User can interrupt the assistant mid-response naturally
Audio Processing
- 16-bit PCM format for both input and output
- Separate sample rates optimized for speech (16kHz capture, 24kHz playback)
- Small buffer sizes for minimal latency
- Continuous streaming with no start/stop gaps between turns
Function Calling Integration
How It Works
The AI model can invoke local Python functions mid-conversation when domain-specific calculations are needed:
- User speaks a request (e.g., "I missed lunch today")
- AI model transcribes and understands the intent
- Model determines a function call is needed and sends a structured request
- Backend extracts function name, arguments, and call ID
- Local function executes the domain calculation
- Result sent back to the model as a structured response
- Model generates a natural language voice response incorporating the result
Domain Functions
The system supports nutrition-focused function calling for scenarios like:
- Missed Meals โ Redistributes missed macronutrients across remaining meals
- Unplanned Food โ Adjusts upcoming meals to compensate for unexpected intake
- Meal Substitutions โ Swaps ingredients while maintaining macro targets
- Activity Tracking โ Estimates calorie burn and adjusts nutrition buffer
Each function uses a macro database with per-food nutritional profiles and performs dynamic calculations with slight stochastic variation for natural-feeling responses.
Execution Safety
- Microphone input is paused during function execution to prevent overlap
- Pending audio frames are dropped to avoid stale context
- Error responses are sent back gracefully if function execution fails
- Normal streaming resumes immediately after function completion
Backend Architecture
FastAPI WebSocket Server
- Single WebSocket endpoint for all client communication
- Session lifecycle management (start, stop, ping/pong health checks)
- One active session at a time with session locking
- CORS middleware for development environments
- Health check endpoint for monitoring
Session Management
- Sessions are created on client connect with mode selection (audio-only, camera, or screen)
- Background async tasks handle audio capture, processing, and playback concurrently
- Graceful disconnection with resource cleanup
- API key validation and error propagation
Multimodal Input (Optional)
Beyond voice, the system supports optional visual context:
- Camera Mode โ Streams webcam frames (1fps) for visual context in conversations
- Screen Mode โ Captures screen content for discussing on-screen information
- Images are resized and compressed before transmission
- Visual context enhances the AI's ability to provide relevant responses
Frontend Interface
- Session Control โ Start/stop listening with clear status indicators
- Status Display โ Real-time connection and session state (idle, connecting, active, error)
- Theme Support โ Light/dark mode with persistence
- Guided Walkthrough โ Step-by-step demo for first-time users
- WebSocket Management โ Automatic reconnection logic
AI Model Configuration
- Native audio modality (no separate STT/TTS pipeline)
- Configurable voice selection from multiple preset voices
- System instructions defining assistant personality, response style, and language handling
- Tool definitions for all available functions with parameter schemas
- Automatic language detection with same-language response
Key Features
- Sub-Second Latency โ Native audio model eliminates STT/TTS pipeline overhead
- Real-Time Bidirectional Audio โ Continuous streaming with < 50ms per-chunk latency
- Function Calling โ Domain-specific calculations executed mid-conversation
- Natural Interruption โ Users can interrupt the assistant naturally without special commands
- Multi-Language โ Automatic language detection with same-language responses
- Multimodal Input โ Optional camera and screen context for visual understanding
- Session Management โ Session lifecycle control with locking and resource cleanup
- Macro Calculations โ Dynamic nutritional adjustments with per-food macro profiles
- Error Recovery โ Graceful handling of function failures and network interruptions
- Extensible โ New functions added by defining schema and handler โ no architecture changes
๊ฒฐ๊ณผ
๊ธฐ์ ์คํ
caseStudyDetail.more ์ฌ๋ก ์ฐ๊ตฌ
๋ ๋ง์ ๊ธฐ์ ๊ตฌํ ์ฌ๋ก๋ฅผ ์ดํด๋ณด์ธ์
OCR ๋ฐ QuickBooks ์ฐ๋์ ํตํ AI ๊ธฐ๋ฐ ์ก์ฅ ์ฒ๋ฆฌ
๋งค์ ์๋ฐฑ ๊ฑด์ ๊ณต๊ธ์ ์ฒด ์ก์ฅ์ ์ฒ๋ฆฌํ๋ ์ค๊ฒฌ ๊ธฐ์ ์ AI/OCR์ ์ฌ์ฉํ์ฌ ์ก์ฅ ๋ฐ์ดํฐ๋ฅผ ์๋์ผ๋ก ์ถ์ถํ๊ณ ์ด๋ฅผ QuickBooks์ ์ง์ ๋๊ธฐํํ์ฌ ์ฅ๋ถ ์ ๋ฆฌ ๋ฐ ์ง๊ธ ์ถ์ ์ ํจ์ผ๋ก์จ ์๋ ๋ฐ์ดํฐ ์ ๋ ฅ์ ์์ ์ผ ํ์ต๋๋ค.
SCTE-35 ๋ง์ปค ํ์ฑ ๋ฐ ๋ค์ค ํ๋ซํผ ํ๋ ์ด์ด ํตํฉ์ ํตํ ํด๋ผ์ด์ธํธ ์ธก ๊ด๊ณ ์ฝ์ (CSAI)
ํ ๋น๋์ค ์คํธ๋ฆฌ๋ฐ ํ๋ซํผ์ ์น, ๋ชจ๋ฐ์ผ ๋ฐ ์ปค๋ฅํฐ๋ TV ์ฑ ์ ๋ฐ์ ๊ฑธ์ณ Client-Side Ad Insertion (CSAI)์ ๊ตฌํํด์ผ ํ์ต๋๋ค. ์ด๋ ์๋ฒ ์ธก ์ฝ์ ์ผ๋ก๋ ์ ๊ณตํ ์ ์๋, ํด๋ฆญ ๊ฐ๋ฅํ ์ค๋ฒ๋ ์ด, ์ปดํจ๋์ธ ๋ฐฐ๋, ๊ฑด๋๋ฐ๊ธฐ ๋ฒํผ ๋ฑ ์์ ํ ๊ด๊ณ ์ํธ์์ฉ ์ง์์ ํตํด ๊ฐ์ธํ๋ ๊ธฐ๊ธฐ ์์ค์ ๊ด๊ณ ๊ฒฝํ์ ๊ฐ๋ฅํ๊ฒ ํฉ๋๋ค.
๋น์ฆ๋์ค ํ์ ์ ์์ํ ์ค๋น๊ฐ ๋์ จ๋์?
๊ทํ์ ๊ณผ์ ์ ์ ์ฌํ ์๋ฃจ์ ์ ์ ์ฉํ๋ ๋ฐฉ๋ฒ์ ๋ํด ๋ ผ์ํด ๋ณด๊ฒ ์ต๋๋ค.