挑战
制作多摄像头内容(采访、播客、小组讨论)需要编辑人员手动筛选数小时的视频素材,以识别活跃说话人并进行剪辑。这个过程存在以下问题:
- 极其耗时(手动审查需要实际时长的 10-15 倍)
- 容易在说话人归属上出现人为错误
- 阻碍内容快速交付的瓶颈
我们的解决方案
我们构建了一个 AI 驱动的视频分析平台,该平台具有深度学习管道,通过融合音频和视觉信号自动检测活跃说话人。
架构
- 后端:采用 MongoDB 和 Redis 的 Python/Flask REST API
- 机器学习管道:TalkNet 音视频融合模型,用于人脸检测的 YOLOv8 Nano,用于转录的 OpenAI Whisper
- GPU 优化:采用 CUDA 的 PyTorch,帧抽取实现 3 倍加速,批处理
- 基础设施:多实例部署,采用基于 MongoDB 的分布式锁
处理管道
- 媒体提取 - 视频下载和音视频分离
- 场景检测 - 通过 PySceneDetect 进行基于内容的边界检测
- 人脸检测 - 采用帧抽取的 YOLOv8 Nano 人脸检测
- 人脸追踪 - 基于 IoU 的跨帧链接
- TalkNet 推理 - 具有多时长评分(1秒、2秒、4秒、6秒窗口)的音视频融合
- 转录 - 基于 Whisper 的语音转文本,带词级时间戳
主要功能
- 具有跨模态注意力(唇部动作 + 音频)的活跃说话人检测
- 多时长置信度评分,实现可靠的说话人识别
- 带词级时间戳的自动转录
- 支持取消的后台任务调度
- 性能监控和 GPU 内存管理
成果
技术栈
常见问题
MicrocosmWorks developed a multimodal fusion model that correlates lip movement visual features extracted from each camera feed with the audio signal using cross-attention layers. The model outputs per-frame speaker probability scores for each visible face, achieving 94% accuracy even when multiple participants speak simultaneously.
MicrocosmWorks optimized the inference pipeline to run on NVIDIA T4 GPUs with TensorRT acceleration, achieving under 150ms end-to-end latency from frame capture to speaker identification. This latency is well within the acceptable range for live production switching, where typical cut delays are 300-500ms.
MicrocosmWorks trained the model on diverse occlusion scenarios and implemented a temporal smoothing algorithm that maintains speaker tracking through brief occlusions using audio-only confidence scores. When visual confidence drops below a threshold, the system falls back to audio source localization using beamforming data from multi-microphone arrays.
MicrocosmWorks built a companion control module that translates speaker detection outputs into standard tally/control signals compatible with Blackmagic ATEM via the ATEM SDK and NewTek NDI for TriCaster systems. Production directors can set the system to auto-switch or advisory mode where it suggests cuts without executing them.
MicrocosmWorks builds custom AI video analysis systems at rates of $30-$50/hr, with a multi-camera active speaker detection system including model training, TensorRT optimization, and switcher integration typically requiring 500-750 development hours. The model training phase requires GPU compute resources that usually add $2,000-$5,000 to the project cost.
