挑战
将水平视频转换为垂直格式是短视频内容制作中最繁琐的步骤之一:
- 为每个片段手动裁剪和重新定位画幅耗时耗力
- 多人物对话需要随着说话者变化而动态重构画幅
- 静态中心裁剪会切掉移动或坐偏的说话者
- 传统的面部检测对于数千个片段的实时重构画幅决策来说速度太慢
- 不同类型的内容(采访、个人 Vlog、演示文稿)需要不同的画幅策略
我们的解决方案
我们构建了一个AI 驱动的面部追踪与智能重构画幅引擎,它能检测视频帧中的面部,追踪其运动,并动态调整垂直裁剪区域以使活动主体保持居中。
架构
- 面部检测:基于 YOLO 的面部检测模型,针对速度进行优化
- 面部追踪:基于 IoU 的逐帧追踪,具有持久的主体 ID
- 重构画幅引擎:基于面部位置和运动的动态裁剪区域计算
- 活动说话者关联:与说话者检测集成,优先显示正在说话的人
- 渲染:带有平滑平移过渡的 FFmpeg 裁剪滤镜链
重构画幅流程
- 面部检测 - 在采样帧上运行 YOLO 面部检测
- 主体追踪 - 使用基于 IoU 的追踪,在帧之间关联面部检测结果
- 说话者优先级 - 当与活动说话者检测结合时,优先显示正在说话的主体
- 裁剪计算 - 根据主要主体位置确定最佳 9:16 裁剪区域
- 平滑处理 - 对裁剪移动应用缓动效果,以避免突兀的跳跃
- 渲染 - FFmpeg 应用带有平滑平移过渡的动态裁剪
主要功能
- 多主体处理 - 追踪多个面部,并确定每个片段的主要主体
- 说话者感知构图 - 与说话者检测集成时,优先显示活动说话者
- 平滑过渡 - 主体之间的缓动平移消除了突兀的剪切
- 内容类型适应 - 针对个人、采访和团体内容的不同构图策略
- 批量处理 - 从一个长视频中重构数百个片段的画幅
- 无需手动干预 - 从检测到最终渲染全程自动化
成果
技术栈
常见问题
MicrocosmWorks implemented a hybrid tracking approach that combines a lightweight face detector running every 5th frame with a KCF optical flow tracker for inter-frame predictions. When occlusion is detected via confidence score drops, the system maintains the last known trajectory with Kalman filtering and re-acquires the face within 200ms of it becoming visible again.
MicrocosmWorks built a saliency-weighted cropping algorithm that prioritizes detected faces, then text regions, then motion areas when determining the 9:16 crop window position. For multi-person scenes, the system uses a configurable priority ranking, defaulting to the active speaker or the largest face, with smooth interpolation between crop positions to avoid jarring shifts.
Yes, MicrocosmWorks implemented a fallback saliency detection mode that activates when no faces are present, using a combination of motion detection, visual attention modeling, and mouse cursor tracking for screen recordings. The system intelligently follows the most relevant content region even in purely visual or text-based footage.
MicrocosmWorks optimized the pipeline for batch workflows, achieving 8x real-time processing speed on a single NVIDIA T4 GPU, meaning a 10-minute video is reframed in approximately 75 seconds. The system supports parallel processing across multiple GPUs, scaling linearly for high-volume content operations.
MicrocosmWorks develops AI video reframing systems at rates of $25-$45/hr, with a full face tracking and smart reframing solution including model optimization, batch processing support, and API integration typically requiring 350-550 development hours. This investment eliminates the need for manual reframing editors, which typically cost $5-$15 per video.