ãã«ãã«ã¡ã©æ åå¶äœã®ããã®AIãæŽ»çšããçºè©±è æ€åº
ãã«ãã«ã¡ã©ã§ã®ã€ã³ã¿ãã¥ãŒãããã«ãã£ã¹ã«ãã·ã§ã³æ®åœ±ãææããããæ åå¶äœäŒç€Ÿã¯ãè€éãªæ åã®äžãããç¹å®ã®ç¬éã«èª°ã話ããŠããããèªåã§ç¹å®ããæ¹æ³ãå¿ èŠãšããŠããŸããã
ãããžã§ã¯ããçžè«ãã
課é¡
ãã«ãã«ã¡ã©ã³ã³ãã³ãïŒã€ã³ã¿ãã¥ãŒãããããã£ã¹ããããã«ãã£ã¹ã«ãã·ã§ã³ïŒã®å¶äœã«ãããŠãç·šéè ã¯äœæéãã®æ åãæäœæ¥ã§æ©éãããŠçºè©±è ãç¹å®ããã«ãããäœæããå¿ èŠããããŸããããã®ããã»ã¹ã¯ä»¥äžã®èª²é¡ãæ±ããŠããŸããïŒ
- éåžžã«æéããããïŒæåã¬ãã¥ãŒã§ã¯ãªã¢ã«ã¿ã€ã ã®10ã15åïŒ
- 話è ç¹å®ã«ãããŠãã¥ãŒãã³ãšã©ãŒãçºçãããã
- è¿ éãªã³ã³ãã³ãå¶äœã劚ããããã«ããã¯
ç§ãã¡ã®ãœãªã¥ãŒã·ã§ã³
åœç€Ÿã¯ãé³å£°ä¿¡å·ãšèŠèŠä¿¡å·ãèåãããããšã§çºè©±è ãèªåçã«æ€åºããããã£ãŒãã©ãŒãã³ã°ãã€ãã©ã€ã³ãæèŒããAIãæŽ»çšããæ ååæãã©ãããã©ãŒã ãæ§ç¯ããŸããã
ã¢ãŒããã¯ãã£
- Backend: MongoDBãšRedisãåããPython/Flask REST API
- ML Pipeline: TalkNetãªãŒãã£ãªããžã¥ã¢ã«èåã¢ãã«ã顿€åºã®ããã®YOLOv8 Nanoãæåèµ·ããã®ããã®OpenAI Whisper
- GPU Optimization: CUDAãçšããPyTorchã3åã®é«éåãå®çŸãããã¬ãŒã éåŒãããããåŠç
- Infrastructure: 忣MongoDBããŒã¹ã®ããã¯ãçšãããã«ãã€ã³ã¹ã¿ã³ã¹ãããã€ã¡ã³ã
åŠçãã€ãã©ã€ã³
- Media Extraction - ãããªã®ããŠã³ããŒããšãªãŒãã£ãª/ãããªåé¢
- Scene Detection - PySceneDetectã«ããã³ã³ãã³ãããŒã¹ã®å¢çæ€åº
- Face Detection - ãã¬ãŒã éåŒããçšããYOLOv8 Nanoã«ãã顿€åº
- Face Tracking - ãã¬ãŒã éã®IoUããŒã¹ã®ãªã³ã¯
- TalkNet Inference - 倿éã¹ã³ã¢ãªã³ã°ïŒ1ç§ã2ç§ã4ç§ã6ç§ã®ãŠã£ã³ããŠïŒã«ãããªãŒãã£ãªããžã¥ã¢ã«èå
- Transcription - åèªã¬ãã«ã®ã¿ã€ã ã¹ã¿ã³ãä»ãWhisperããŒã¹é³å£°èªè
äž»èŠæ©èœ
- ã¯ãã¹ã¢ãŒãã«ã¢ãã³ã·ã§ã³ïŒåã®åãïŒé³å£°ïŒã«ããçºè©±è æ€åº
- å ç¢ãªè©±è ç¹å®ã®ããã®å€æéä¿¡é ŒåºŠã¹ã³ã¢ãªã³ã°
- åèªã¬ãã«ã®ã¿ã€ã ã¹ã¿ã³ãä»ãèªåæåèµ·ãã
- ãã£ã³ã»ã«ãµããŒãä»ãããã¯ã°ã©ãŠã³ããžã§ãã¹ã±ãžã¥ãŒãªã³ã°
- ããã©ãŒãã³ã¹ç£èŠãšGPUã¡ã¢ãªç®¡ç
ææ
æè¡ã¹ã¿ãã¯
caseStudyDetail.more ã±ãŒã¹ã¹ã¿ãã£
ãã®ä»ã®æè¡å®è£ äºäŸãã芧ãã ãã
ãªã¢ã«ã¿ã€ã åç»ãªããžã§ã¯ããã©ããã³ã°ãšèªåã»ã³ã¿ãªã³ã°ã»ãªã«ããª
ããæ åå¶äœããŒã ã¯ãåç»æ åå ã®éžæããããªããžã§ã¯ãã远跡ãããã®ç§»åã«åãããŠãã¬ãŒã å ã§èªåçã«äžå€®ã«ç¶æã§ããããŒã«ãå¿ èŠãšããŠããŸããããã®ããŒã«ã«ã¯ãã¹ã ãŒãºãªãã©ã³ãžã·ã§ã³ãè€æ°ã®ãã©ããã³ã°ã¢ã«ãŽãªãºã ãªãã·ã§ã³ããããŠãã©ãã«ãŒãã¿ãŒã²ãããèŠå€±ã£ãéã®èªåãªã«ããªæ©èœãæ±ããããŸããã
AIãæŽ»çšããåæã«ããã¯ãã¹ãã©ãããã©ãŒã ã¢ãã€ã«åç»ç·šé
ã³ã³ãã³ãã¯ãªãšã€ã¿ãŒãã¡ãã£ã¢ãããã§ãã·ã§ãã«ã¯ãå€åºå ã§ã®ããã¹ããŒããªç·šéã¯ãŒã¯ãããŒã®ããã«ãAIé§åååæã®çµæã掻çšã§ããã¢ãã€ã«ãã¡ãŒã¹ãã®åç»ç·šéãœãªã¥ãŒã·ã§ã³ãå¿ èŠãšããŠããŸããã
ãããã質å
MicrocosmWorks developed a multimodal fusion model that correlates lip movement visual features extracted from each camera feed with the audio signal using cross-attention layers. The model outputs per-frame speaker probability scores for each visible face, achieving 94% accuracy even when multiple participants speak simultaneously.
MicrocosmWorks optimized the inference pipeline to run on NVIDIA T4 GPUs with TensorRT acceleration, achieving under 150ms end-to-end latency from frame capture to speaker identification. This latency is well within the acceptable range for live production switching, where typical cut delays are 300-500ms.
MicrocosmWorks trained the model on diverse occlusion scenarios and implemented a temporal smoothing algorithm that maintains speaker tracking through brief occlusions using audio-only confidence scores. When visual confidence drops below a threshold, the system falls back to audio source localization using beamforming data from multi-microphone arrays.
MicrocosmWorks built a companion control module that translates speaker detection outputs into standard tally/control signals compatible with Blackmagic ATEM via the ATEM SDK and NewTek NDI for TriCaster systems. Production directors can set the system to auto-switch or advisory mode where it suggests cuts without executing them.
MicrocosmWorks builds custom AI video analysis systems at rates of $30-$50/hr, with a multi-camera active speaker detection system including model training, TensorRT optimization, and switcher integration typically requiring 500-750 development hours. The model training phase requires GPU compute resources that usually add $2,000-$5,000 to the project cost.
ããžãã¹ã®å€é©ã®æºåã¯ã§ããŠããŸããïŒ
ã客æ§ã®èª²é¡ã«é¡äŒŒã®ãœãªã¥ãŒã·ã§ã³ãé©çšããæ¹æ³ã«ã€ããŠè©±ãåããŸãããã