Video Creation公開日 June 18, 2026 · 更新日 May 25, 2026

AI Face Tracking & Smart Reframing for Vertical Video Conversion

A content repurposing platform needed to automatically convert horizontal (16:9) long-form videos into vertical (9:16) short-form clips while keeping speakers and subjects perfectly centered — without any manual cropping or keyframing.

プロジェクトを相談する

Video Creation

Domain

Technologies

Key Results

Delivered

Status

課題

Converting horizontal video to vertical format was one of the most tedious steps in short-form content production:

Manually cropping and repositioning the frame for every clip was time-consuming
Multi-person conversations required dynamic reframing as speakers changed
Static center-crop cut off speakers who moved or sat off-center
Traditional face detection was too slow for real-time reframing decisions across thousands of clips
Different content types (interviews, solo vlogs, presentations) required different framing strategies

私たちのソリューション

We built an AI-powered face tracking and smart reframing engine that detects faces in video frames, tracks their movement, and dynamically adjusts the vertical crop region to keep the active subject centered.

Architecture

Face Detection: YOLO-based face detection model optimized for speed
Face Tracking: IoU-based frame-to-frame tracking with persistent subject IDs
Reframing Engine: Dynamic crop region calculation based on face positions and movement
Active Speaker Coupling: Integration with speaker detection to prioritize the person talking
Rendering: FFmpeg crop filter chain with smooth pan transitions

Reframing Pipeline

Face Detection - Run YOLO face detection across sampled frames
Subject Tracking - Link face detections across frames using IoU-based tracking
Speaker Priority - When coupled with active speaker detection, prioritize the talking subject
Crop Calculation - Determine optimal 9:16 crop region based on primary subject position
Smoothing - Apply easing to crop movement to avoid jarring jumps
Rendering - FFmpeg applies the dynamic crop with smooth pan transitions

Key Features

Multi-Subject Handling - Tracks multiple faces and determines the primary subject per segment
Speaker-Aware Framing - Prioritizes the active speaker when integrated with speaker detection
Smooth Transitions - Eased panning between subjects eliminates jarring cuts
Content-Type Adaptation - Different framing strategies for solo, interview, and group content
Batch Processing - Reframe hundreds of clips from a single long-form video
No Manual Intervention - Fully automated from detection to final render

成果

Time Savings: Eliminated 2-5 minutes of manual cropping per clip

Quality: Subjects stayed centered 95%+ of the time across tested content

Scale: Processed thousands of clips daily without human intervention

技術スタック

YOLOPythonFFmpegOpenCVIoU TrackingNode.jsGPU-Accelerated Inference

caseStudyDetail.more ケーススタディ

その他の技術実装事例をご覧ください

Video Creation

クロスプラットフォームソーシャルメディアスケジューリング & パフォーマンス分析

毎週何十ものショートフォームクリップを制作するコンテンツクリエイターは、投稿戦略を最適化するための洞察を得ながら、単一のダッシュボードから TikTok、YouTube Shorts、Instagram Reels にコンテンツを配信するための統合されたスケジューリングおよび分析システムを必要としていました。

ケーススタディを読む

Video Creation

グローバルコンテンツ配信のための多言語キャプション翻訳

国際的な視聴者を持つコンテンツクリエイターは、オリジナルの音声を維持しつつ、ビデオキャプションを30以上の言語に翻訳することでリーチを拡大し、世界中の視聴者が母国語でコンテンツを消費できるようにする必要がありました。

ケーススタディを読む

よくある質問

MicrocosmWorks implemented a hybrid tracking approach that combines a lightweight face detector running every 5th frame with a KCF optical flow tracker for inter-frame predictions. When occlusion is detected via confidence score drops, the system maintains the last known trajectory with Kalman filtering and re-acquires the face within 200ms of it becoming visible again.

MicrocosmWorks built a saliency-weighted cropping algorithm that prioritizes detected faces, then text regions, then motion areas when determining the 9:16 crop window position. For multi-person scenes, the system uses a configurable priority ranking, defaulting to the active speaker or the largest face, with smooth interpolation between crop positions to avoid jarring shifts.

Yes, MicrocosmWorks implemented a fallback saliency detection mode that activates when no faces are present, using a combination of motion detection, visual attention modeling, and mouse cursor tracking for screen recordings. The system intelligently follows the most relevant content region even in purely visual or text-based footage.

MicrocosmWorks optimized the pipeline for batch workflows, achieving 8x real-time processing speed on a single NVIDIA T4 GPU, meaning a 10-minute video is reframed in approximately 75 seconds. The system supports parallel processing across multiple GPUs, scaling linearly for high-volume content operations.

MicrocosmWorks develops AI video reframing systems at rates of $25-$45/hr, with a full face tracking and smart reframing solution including model optimization, batch processing support, and API integration typically requiring 350-550 development hours. This investment eliminates the need for manual reframing editors, which typically cost $5-$15 per video.

ビジネスの変革の準備はできていますか？

お客様の課題に類似のソリューションを適用する方法について話し合いましょう。

お問い合わせ caseStudyDetail.viewAllCaseStudies

AI Face Tracking & Smart Reframing for Vertical Video Conversion

課題

私たちのソリューション

Architecture

Reframing Pipeline

Key Features

成果

技術スタック

caseStudyDetail.more ケーススタディ

クロスプラットフォーム ソーシャルメディア スケジューリング & パフォーマンス分析

グローバルコンテンツ配信のための多言語キャプション翻訳

よくある質問

ビジネスの変革の準備はできていますか？

自動キャプションスタイリング＆動画エクスポートエンジン

クロスプラットフォームソーシャルメディアスケジューリング & パフォーマンス分析