MicrocosmWorks๋””์ง€ํ„ธ ์ฝ”์Šค๋ชจ์Šค ํ˜์‹  ๋ฐ ์„ค๊ณ„
์†Œ๊ฐœ์—ฐ๋ฝ์ฒ˜
MicrocosmWorks๋””์ง€ํ„ธ ์ฝ”์Šค๋ชจ์Šค๋ฅผ ํ˜์‹ ํ•˜๊ณ  ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค

์ค‘์š”ํ•œ IT ์†”๋ฃจ์…˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์ˆ , ๋ณด์•ˆ์— ์—ด์ •์ ์ด๋ฉฐ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ํ˜์‹ ์ ์ธ IT ์ธํ”„๋ผ๋ฅผ ํ†ตํ•ด ๋น„์ฆˆ๋‹ˆ์Šค ์„ฑ์žฅ์„ ๋•์Šต๋‹ˆ๋‹ค.

[email protected]
+91 7011868196
New Delhi, India

AI ์„ฑ์žฅ ํ—ˆ๋ธŒ

AI ํ—ˆ๋ธŒ์Šคํƒ€ํŠธ์—… ํ˜์‹ ๊ธฐ์—… ๊ฐ€์†๊ธฐ

์†”๋ฃจ์…˜

๋ชจ๋“  ์†”๋ฃจ์…˜์›ฐ๋‹ˆ์Šค ๋ฐ ํ”ผํŠธ๋‹ˆ์Šค ์•ฑAI ๋น„๋””์˜ค ํ”Œ๋žซํผAI ์—์ด์ „ํŠธ ๊ฐœ๋ฐœ

์ž์›

ํ†ต์ฐฐ๋ ฅ์‚ฐ์—… ๊ฐ€์ด๋“œ์‚ฌ์šฉ ์‚ฌ๋ก€ ์ฒญ์‚ฌ์ง„์•„ํ‚คํ…์ฒ˜ ํŒจํ„ด์‚ฌ๋ก€ ์—ฐ๊ตฌ

ํšŒ์‚ฌ

ํšŒ์‚ฌ ์†Œ๊ฐœ์—ฐ๋ฝ์ฒ˜์šฐ๋ฆฌ์˜ ์ž‘์—…

์„œ๋น„์Šค

๋””์ง€ํ„ธ ์ปจ์„คํŒ…ํด๋ผ์šฐ๋“œ ์ธํ”„๋ผSaaS ๊ฐœ๋ฐœAI ๊ฐœ๋ฐœ๋น„๋””์˜ค ๊ธฐ์ˆ 
ERP ๊ฐœ๋ฐœZoho ๋งž์ถคํ™”Odoo ๊ฐœ๋ฐœSalesforce ํ†ตํ•ฉ๋งž์ถคํ˜• CRM ๊ฐœ๋ฐœ
QuickBooks ํ†ตํ•ฉIoT ์†”๋ฃจ์…˜๋ธ”๋ก์ฒด์ธ ๊ฐœ๋ฐœ
์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ ์ปจ์„คํŒ…IT ์ง€์› - L3

ยฉ 2026 MicrocosmWorks. ๋ชจ๋“  ๊ถŒ๋ฆฌ ๋ณด์œ .

๊ฐœ์ธ์ •๋ณด ์ฒ˜๋ฆฌ๋ฐฉ์นจ์„œ๋น„์Šค ์•ฝ๊ด€
์‚ฌ๋ก€ ์—ฐ๊ตฌ ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ
Video Analysis๊ฒŒ์‹œ์ผ June 18, 2026 ยท ์ˆ˜์ •์ผ May 25, 2026

AI-Powered Active Speaker Detection for Multi-Camera Video Production

A media production company handling multi-camera interview and panel discussion shoots needed an automated way to identify who is speaking at any given moment across complex video footage.

ํ”„๋กœ์ ํŠธ ์ƒ๋‹ดํ•˜๊ธฐ
ai-active-speaker-detection.webp
Video Analysis
Domain
11
Technologies
4
Key Results
Delivered
Status

๊ณผ์ œ

Producing multi-camera content (interviews, podcasts, panel discussions) required editors to manually scrub through hours of footage to identify active speakers and create cuts. This process was:

  • Extremely time-consuming (10-15x real-time for manual review)
  • Prone to human error in speaker attribution
  • A bottleneck preventing rapid content turnaround

์šฐ๋ฆฌ์˜ ์†”๋ฃจ์…˜

We built an AI-powered video analysis platform with a deep learning pipeline that automatically detects active speakers by fusing audio and visual signals.

Architecture

  • Backend: Python/Flask REST API with MongoDB and Redis
  • ML Pipeline: TalkNet audio-visual fusion model, YOLOv8 Nano for face detection, OpenAI Whisper for transcription
  • GPU Optimization: PyTorch with CUDA, frame decimation for 3x speedup, batch processing
  • Infrastructure: Multi-instance deployment with distributed MongoDB-based locking

Processing Pipeline

  1. Media Extraction - Video download and audio/video separation
  2. Scene Detection - Content-based boundary detection via PySceneDetect
  3. Face Detection - YOLOv8 Nano face detection with frame decimation
  4. Face Tracking - IoU-based linking across frames
  5. TalkNet Inference - Audio-visual fusion with multi-duration scoring (1s, 2s, 4s, 6s windows)
  6. Transcription - Whisper-based speech-to-text with word-level timestamps

Key Features

  • Active speaker detection with cross-modal attention (lip movements + audio)
  • Multi-duration confidence scoring for robust speaker identification
  • Automatic transcription with word-level timestamps
  • Background job scheduling with cancellation support
  • Performance monitoring and GPU memory management

๊ฒฐ๊ณผ

Processing Speed: 30-minute video analyzed in 10-15 minutes on 12GB+ GPU
Accuracy: High-confidence speaker attribution via multi-duration scoring
Scalability: Distributed architecture supporting horizontal scaling across servers

๊ธฐ์ˆ  ์Šคํƒ

PythonFlaskPyTorchTalkNetYOLOv8OpenAI WhisperMongoDBRedisFFmpegPySceneDetectCUDA

caseStudyDetail.more ์‚ฌ๋ก€ ์—ฐ๊ตฌ

๋” ๋งŽ์€ ๊ธฐ์ˆ  ๊ตฌํ˜„ ์‚ฌ๋ก€๋ฅผ ์‚ดํŽด๋ณด์„ธ์š”

Video Analysis

์‹ค์‹œ๊ฐ„ ๋น„๋””์˜ค ๊ฐ์ฒด ์ถ”์  ๋ฐ ์ž๋™ ์ค‘์•™ ์ •๋ ฌ & ๋ณต๊ตฌ

๋น„๋””์˜ค ์ œ์ž‘ํŒ€์€ ์˜์ƒ์—์„œ ์„ ํƒํ•œ ๊ฐ์ฒด๋ฅผ ์ถ”์ ํ•˜๊ณ , ์›€์ง์ผ ๋•Œ ํ”„๋ ˆ์ž„ ์ค‘์•™์— ์ž๋™์œผ๋กœ ์œ ์ง€ํ•˜๋Š” ๋„๊ตฌ๋ฅผ ํ•„์š”๋กœ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋„๊ตฌ๋Š” ๋ถ€๋“œ๋Ÿฌ์šด ์ „ํ™˜, ๋‹ค์–‘ํ•œ ์ถ”์  ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ต์…˜, ๊ทธ๋ฆฌ๊ณ  ์ถ”์ ๊ธฐ๊ฐ€ ๋Œ€์ƒ์„ ๋†“์ณค์„ ๋•Œ์˜ ์ž๋™ ๋ณต๊ตฌ ๊ธฐ๋Šฅ์„ ๊ฐ–์ถฐ์•ผ ํ–ˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ๋ก€ ์—ฐ๊ตฌ ์ฝ๊ธฐ
Video Analysis

AI ๊ธฐ๋ฐ˜ ๋ถ„์„ ๊ธฐ๋Šฅ์„ ํ†ตํ•œ ํฌ๋กœ์Šค ํ”Œ๋žซํผ ๋ชจ๋ฐ”์ผ ๋น„๋””์˜ค ํŽธ์ง‘

์ฝ˜ํ…์ธ  ์ œ์ž‘์ž์™€ ๋ฏธ๋””์–ด ์ „๋ฌธ๊ฐ€๋“ค์€ ์ด๋™ ์ค‘์—๋„ AI ๊ธฐ๋ฐ˜ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋” ์Šค๋งˆํŠธํ•œ ํŽธ์ง‘ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ์ง€์›ํ•˜๋Š” ๋ชจ๋ฐ”์ผ ์šฐ์„  ๋น„๋””์˜ค ํŽธ์ง‘ ์†”๋ฃจ์…˜์„ ํ•„์š”๋กœ ํ–ˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ๋ก€ ์—ฐ๊ตฌ ์ฝ๊ธฐ

์ž์ฃผ ๋ฌป๋Š” ์งˆ๋ฌธ

MicrocosmWorks developed a multimodal fusion model that correlates lip movement visual features extracted from each camera feed with the audio signal using cross-attention layers. The model outputs per-frame speaker probability scores for each visible face, achieving 94% accuracy even when multiple participants speak simultaneously.

MicrocosmWorks optimized the inference pipeline to run on NVIDIA T4 GPUs with TensorRT acceleration, achieving under 150ms end-to-end latency from frame capture to speaker identification. This latency is well within the acceptable range for live production switching, where typical cut delays are 300-500ms.

MicrocosmWorks trained the model on diverse occlusion scenarios and implemented a temporal smoothing algorithm that maintains speaker tracking through brief occlusions using audio-only confidence scores. When visual confidence drops below a threshold, the system falls back to audio source localization using beamforming data from multi-microphone arrays.

MicrocosmWorks built a companion control module that translates speaker detection outputs into standard tally/control signals compatible with Blackmagic ATEM via the ATEM SDK and NewTek NDI for TriCaster systems. Production directors can set the system to auto-switch or advisory mode where it suggests cuts without executing them.

MicrocosmWorks builds custom AI video analysis systems at rates of $30-$50/hr, with a multi-camera active speaker detection system including model training, TensorRT optimization, and switcher integration typically requiring 500-750 development hours. The model training phase requires GPU compute resources that usually add $2,000-$5,000 to the project cost.

๋น„์ฆˆ๋‹ˆ์Šค ํ˜์‹ ์„ ์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”?

๊ท€ํ•˜์˜ ๊ณผ์ œ์— ์œ ์‚ฌํ•œ ์†”๋ฃจ์…˜์„ ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ๋…ผ์˜ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋ฌธ์˜ํ•˜๊ธฐcaseStudyDetail.viewAllCaseStudies
Efficiency: 3x speedup through frame decimation optimization
AI Accounting

OCR ๋ฐ QuickBooks ์—ฐ๋™์„ ํ†ตํ•œ AI ๊ธฐ๋ฐ˜ ์†ก์žฅ ์ฒ˜๋ฆฌ

๋งค์›” ์ˆ˜๋ฐฑ ๊ฑด์˜ ๊ณต๊ธ‰์—…์ฒด ์†ก์žฅ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ์ค‘๊ฒฌ ๊ธฐ์—…์€ AI/OCR์„ ์‚ฌ์šฉํ•˜์—ฌ ์†ก์žฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ž๋™์œผ๋กœ ์ถ”์ถœํ•˜๊ณ  ์ด๋ฅผ QuickBooks์— ์ง์ ‘ ๋™๊ธฐํ™”ํ•˜์—ฌ ์žฅ๋ถ€ ์ •๋ฆฌ ๋ฐ ์ง€๊ธ‰ ์ถ”์ ์„ ํ•จ์œผ๋กœ์จ ์ˆ˜๋™ ๋ฐ์ดํ„ฐ ์ž…๋ ฅ์„ ์—†์• ์•ผ ํ–ˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ๋ก€ ์—ฐ๊ตฌ ์ฝ๊ธฐ