MicrocosmWorksInovasi dan Seni Bina Kosmos Digital
TentangHubungi
MicrocosmWorksMemperbaharui dan Merangka Kosmos Digital

Menyampaikan penyelesaian IT yang penting. Kami bersemangat tentang teknologi, keselamatan, dan membantu perniagaan berkembang melalui infrastruktur IT yang boleh dipercayai dan inovatif.

[email protected]
+91 7011868196
New Delhi, India

Pusat Pertumbuhan AI

AI HubInovasi PermulaanPemecut Perusahaan

Penyelesaian

Semua PenyelesaianAplikasi Kesihatan & KecergasanPlatform Video AIPembangunan Ejen AI

Sumber

WawasanPanduan IndustriPelan Tindakan Kes PenggunaanCorak Seni BinaKajian Kes

Syarikat

Tentang KamiHubungiKerja Kami

Perkhidmatan

Perundingan DigitalInfrastruktur AwanPembangunan SaaSPembangunan AITeknologi Video
Pembangunan ERPPenyesuaian ZohoPembangunan OdooIntegrasi SalesforcePembangunan CRM Tersuai
Integrasi QuickBooksPenyelesaian IoTPembangunan Blockchain
Perundingan Keselamatan SiberSokongan IT - L3

Β© 2026 MicrocosmWorks. Hak cipta terpelihara.

Dasar PrivasiTerma Perkhidmatan
Kembali ke Kajian Kes
AI Voice AgentsDiterbitkan June 18, 2026 Β· Dikemas kini May 25, 2026

Real-Time Voice AI Assistant with Function Calling & Bidirectional Audio Streaming

A fitness and nutrition platform needed a voice-first AI assistant that could respond to users in real-time with natural conversation, execute domain-specific calculations (meal adjustments, calorie tracking), and speak responses back β€” all with sub-second latency for a truly conversational experience.

Bincangkan Projek Anda
realtime-voice-ai-assistant.webp
AI Voice Agents
Domain
10
Technologies
5
Key Results
Delivered
Status

Cabaran

Building a production-grade voice AI assistant presented unique real-time engineering challenges:

  • Latency β€” Traditional speech-to-text β†’ LLM β†’ text-to-speech pipelines added 3-5 seconds of delay, breaking conversational flow
  • Function Calling β€” The assistant needed to execute domain logic (nutrition calculations, meal plan adjustments) mid-conversation, not just chat
  • Audio Streaming β€” Bidirectional audio needed to flow continuously without buffering gaps or echo issues
  • Context Awareness β€” The assistant needed to maintain conversation context across turns while handling interruptions
  • Multi-Language β€” Users spoke in different languages and expected responses in the same language
  • Session Isolation β€” Each voice session needed independent state management without cross-talk

Penyelesaian Kami

We built a real-time voice AI assistant powered by Google's Gemini Live API with native audio capabilities, custom function calling for domain-specific calculations, and a React frontend with WebSocket-based audio streaming.

Architecture

  • AI Model: Gemini with native audio input/output and function calling
  • Backend: Python/FastAPI with WebSocket endpoint for bidirectional audio
  • Audio Pipeline: PyAudio for microphone/speaker I/O with real-time streaming
  • Frontend: React with Vite and Tailwind CSS for session control UI
  • Communication: WebSocket for low-latency JSON messaging and binary audio transport
  • Multimodal: Optional camera and screen capture for visual context

Real-Time Audio Pipeline

Bidirectional Streaming

The system maintains continuous audio streams in both directions:

  • Input: Microphone audio captured at 16kHz mono, chunked into small frames, and streamed to the AI model in real-time
  • Output: AI-generated speech received at 24kHz and played through speakers immediately
  • No Batching: Audio chunks are sent as captured β€” no accumulation delays
  • Interrupt Handling: User can interrupt the assistant mid-response naturally

Audio Processing

  • 16-bit PCM format for both input and output
  • Separate sample rates optimized for speech (16kHz capture, 24kHz playback)
  • Small buffer sizes for minimal latency
  • Continuous streaming with no start/stop gaps between turns

Function Calling Integration

How It Works

The AI model can invoke local Python functions mid-conversation when domain-specific calculations are needed:

  1. User speaks a request (e.g., "I missed lunch today")
  2. AI model transcribes and understands the intent
  3. Model determines a function call is needed and sends a structured request
  4. Backend extracts function name, arguments, and call ID
  5. Local function executes the domain calculation
  6. Result sent back to the model as a structured response
  7. Model generates a natural language voice response incorporating the result

Domain Functions

The system supports nutrition-focused function calling for scenarios like:

  • Missed Meals β€” Redistributes missed macronutrients across remaining meals
  • Unplanned Food β€” Adjusts upcoming meals to compensate for unexpected intake
  • Meal Substitutions β€” Swaps ingredients while maintaining macro targets
  • Activity Tracking β€” Estimates calorie burn and adjusts nutrition buffer

Each function uses a macro database with per-food nutritional profiles and performs dynamic calculations with slight stochastic variation for natural-feeling responses.

Execution Safety

  • Microphone input is paused during function execution to prevent overlap
  • Pending audio frames are dropped to avoid stale context
  • Error responses are sent back gracefully if function execution fails
  • Normal streaming resumes immediately after function completion

Backend Architecture

FastAPI WebSocket Server

  • Single WebSocket endpoint for all client communication
  • Session lifecycle management (start, stop, ping/pong health checks)
  • One active session at a time with session locking
  • CORS middleware for development environments
  • Health check endpoint for monitoring

Session Management

  • Sessions are created on client connect with mode selection (audio-only, camera, or screen)
  • Background async tasks handle audio capture, processing, and playback concurrently
  • Graceful disconnection with resource cleanup
  • API key validation and error propagation

Multimodal Input (Optional)

Beyond voice, the system supports optional visual context:

  • Camera Mode β€” Streams webcam frames (1fps) for visual context in conversations
  • Screen Mode β€” Captures screen content for discussing on-screen information
  • Images are resized and compressed before transmission
  • Visual context enhances the AI's ability to provide relevant responses

Frontend Interface

  • Session Control β€” Start/stop listening with clear status indicators
  • Status Display β€” Real-time connection and session state (idle, connecting, active, error)
  • Theme Support β€” Light/dark mode with persistence
  • Guided Walkthrough β€” Step-by-step demo for first-time users
  • WebSocket Management β€” Automatic reconnection logic

AI Model Configuration

  • Native audio modality (no separate STT/TTS pipeline)
  • Configurable voice selection from multiple preset voices
  • System instructions defining assistant personality, response style, and language handling
  • Tool definitions for all available functions with parameter schemas
  • Automatic language detection with same-language response

Key Features

  1. Sub-Second Latency β€” Native audio model eliminates STT/TTS pipeline overhead
  2. Real-Time Bidirectional Audio β€” Continuous streaming with < 50ms per-chunk latency
  3. Function Calling β€” Domain-specific calculations executed mid-conversation
  4. Natural Interruption β€” Users can interrupt the assistant naturally without special commands
  5. Multi-Language β€” Automatic language detection with same-language responses
  6. Multimodal Input β€” Optional camera and screen context for visual understanding
  7. Session Management β€” Session lifecycle control with locking and resource cleanup
  8. Macro Calculations β€” Dynamic nutritional adjustments with per-food macro profiles
  9. Error Recovery β€” Graceful handling of function failures and network interruptions
  10. Extensible β€” New functions added by defining schema and handler β€” no architecture changes

Keputusan

First Response Latency: 500-1200ms (vs. 3-5s for traditional STT→LLM→TTS pipelines)
Session Start Time: ~200ms
Audio Streaming Latency: < 50ms per chunk (real-time)

Timbunan Teknologi

Google Gemini Live APIPythonFastAPIWebSocketPyAudioReactViteTailwind CSSOpenCVPillow

caseStudyDetail.more Kajian Kes

Terokai lebih banyak pelaksanaan teknikal kami

AI Accounting

Pemprosesan Invois Berkuasa AI dengan OCR dan Integrasi QuickBooks

Sebuah perniagaan bersaiz sederhana yang memproses ratusan invois vendor setiap bulan perlu menghapuskan kemasukan data manual dengan mengekstrak data invois secara automatik menggunakan AI/OCR dan menyegerakkannya terus ke dalam QuickBooks untuk tujuan simpan kira dan penjejakan pembayaran.

Baca Kajian Kes
Video Encoding

Penyisipan Iklan Sisi Klien (CSAI) dengan Penghuraian Penanda SCTE-35 & Integrasi Pemain Berbilang Platform

Sebuah platform penstriman video perlu melaksanakan Client-Side Ad Insertion (CSAI) merentasi aplikasi web, mudah alih, dan TV bersambung β€” membolehkan pengalaman iklan yang diperibadikan pada peringkat peranti dengan sokongan interaksi iklan penuh (lapisan tindanan boleh klik, sepanduk pendamping, butang langkau) yang tidak dapat disediakan oleh penyisipan sisi pelayan.

Bersedia untuk Mentransformasi Perniagaan Anda?

Mari bincangkan bagaimana kami boleh mengaplikasikan penyelesaian serupa untuk cabaran anda.

Hubungi KamicaseStudyDetail.viewAllCaseStudies
Function Execution: Domain calculations completed within the conversation flow
User Experience: Natural conversational feel with interrupt support
Baca Kajian Kes
Web Scraping

Platform Pengikisan & Penjanaan Kandungan Blog Dikuasakan AI

Sebuah syarikat media memerlukan platform kandungan pintar yang boleh mengautomasikan penciptaan kandungan blog dengan mengikis kandungan web sedia ada, menganalisisnya menggunakan AI, dan menjana artikel blog asli yang dioptimumkan SEO daripada data yang diekstrak.

Baca Kajian Kes

Soalan Lazim

MicrocosmWorks engineered a bidirectional WebSocket audio pipeline that streams user speech to the ASR engine in real-time chunks, begins LLM inference before the user finishes speaking using streaming transcription, and starts text-to-speech synthesis on the first tokens of the response. This pipelining approach achieves response latencies under 800ms from end-of-speech to first audio output, which users perceive as natural conversational turn-taking.

MicrocosmWorks integrated structured function calling where the LLM can invoke predefined APIs like booking appointments, querying databases, or triggering workflows based on the conversation context, with the results spoken back to the caller naturally. The system includes confirmation flows for high-stakes actions like payments or cancellations, where the assistant verbally confirms the details and waits for the caller's explicit approval before executing.

Yes, MicrocosmWorks implemented barge-in detection that allows callers to interrupt the assistant mid-response, immediately stopping audio playback and processing the new utterance. The ASR pipeline includes noise cancellation preprocessing and supports models fine-tuned on diverse accents, achieving over 90% transcription accuracy in noisy environments typical of phone calls from cars, offices, or public spaces.

MicrocosmWorks built the voice assistant with SIP trunk integration and Twilio connectivity, supporting deployment on existing business phone numbers, IVR systems, and contact center platforms without requiring callers to install any app or use a special interface. The platform handles call routing, queue management, and warm transfers to human agents when the AI determines a conversation requires human expertise.

MicrocosmWorks develops custom voice AI assistants at rates between $30-$50/hr, and while the upfront build cost exceeds managed platform setup fees, a custom solution avoids the per-minute usage charges that platforms like Dialogflow CX or Amazon Lex impose, which become significant at high call volumes. Custom builds also give you full control over the LLM, voice persona, and function calling logic, which managed platforms constrain with rigid dialog flow paradigms.