AI Face Tracking & Smart Reframing for Vertical Video Conversion
A content repurposing platform needed to automatically convert horizontal (16:9) long-form videos into vertical (9:16) short-form clips while keeping speakers and subjects perfectly centered โ without any manual cropping or keyframing.
ํ๋ก์ ํธ ์๋ดํ๊ธฐ๊ณผ์
Converting horizontal video to vertical format was one of the most tedious steps in short-form content production:
- Manually cropping and repositioning the frame for every clip was time-consuming
- Multi-person conversations required dynamic reframing as speakers changed
- Static center-crop cut off speakers who moved or sat off-center
- Traditional face detection was too slow for real-time reframing decisions across thousands of clips
- Different content types (interviews, solo vlogs, presentations) required different framing strategies
์ฐ๋ฆฌ์ ์๋ฃจ์
We built an AI-powered face tracking and smart reframing engine that detects faces in video frames, tracks their movement, and dynamically adjusts the vertical crop region to keep the active subject centered.
Architecture
- Face Detection: YOLO-based face detection model optimized for speed
- Face Tracking: IoU-based frame-to-frame tracking with persistent subject IDs
- Reframing Engine: Dynamic crop region calculation based on face positions and movement
- Active Speaker Coupling: Integration with speaker detection to prioritize the person talking
- Rendering: FFmpeg crop filter chain with smooth pan transitions
Reframing Pipeline
- Face Detection - Run YOLO face detection across sampled frames
- Subject Tracking - Link face detections across frames using IoU-based tracking
- Speaker Priority - When coupled with active speaker detection, prioritize the talking subject
- Crop Calculation - Determine optimal 9:16 crop region based on primary subject position
- Smoothing - Apply easing to crop movement to avoid jarring jumps
- Rendering - FFmpeg applies the dynamic crop with smooth pan transitions
Key Features
- Multi-Subject Handling - Tracks multiple faces and determines the primary subject per segment
- Speaker-Aware Framing - Prioritizes the active speaker when integrated with speaker detection
- Smooth Transitions - Eased panning between subjects eliminates jarring cuts
- Content-Type Adaptation - Different framing strategies for solo, interview, and group content
- Batch Processing - Reframe hundreds of clips from a single long-form video
- No Manual Intervention - Fully automated from detection to final render
๊ฒฐ๊ณผ
๊ธฐ์ ์คํ
caseStudyDetail.more ์ฌ๋ก ์ฐ๊ตฌ
๋ ๋ง์ ๊ธฐ์ ๊ตฌํ ์ฌ๋ก๋ฅผ ์ดํด๋ณด์ธ์
ํฌ๋ก์ค ํ๋ซํผ ์์ ๋ฏธ๋์ด ์ค์ผ์ค๋ง & ์ฑ๊ณผ ๋ถ์
๋งค์ฃผ ์์ญ ๊ฐ์ ์ํผ ํด๋ฆฝ์ ์ ์ํ๋ ์ฝํ ์ธ ํฌ๋ฆฌ์์ดํฐ๋ค์ ๋จ์ผ ๋์๋ณด๋์์ TikTok, YouTube Shorts, Instagram Reels์ ์ฝํ ์ธ ๋ฅผ ๋ฐฐํฌํ๊ณ ๊ฒ์ ์ ๋ต์ ์ต์ ํํ ํต์ฐฐ๋ ฅ์ ์ป๊ธฐ ์ํ ํตํฉ ์ค์ผ์ค๋ง ๋ฐ ๋ถ์ ์์คํ ์ด ํ์ํ์ต๋๋ค.
๊ธ๋ก๋ฒ ์ฝํ ์ธ ๋ฐฐํฌ๋ฅผ ์ํ ๋ค๊ตญ์ด ์๋ง ๋ฒ์ญ
๊ตญ์ ์ ์ธ ์์ฒญ์์ธต์ ๊ฐ์ง ์ฝํ ์ธ ํฌ๋ฆฌ์์ดํฐ๋ค์ ์๋ณธ ์ค๋์ค๋ฅผ ๋ณด์กดํ๋ฉด์ ๋น๋์ค ์๋ง์ 30๊ฐ ์ด์์ ์ธ์ด๋ก ๋ฒ์ญํ์ฌ ๋๋ฌ ๋ฒ์๋ฅผ ํ์ฅํด์ผ ํ์ต๋๋ค. ์ด๋ฅผ ํตํด ์ ์ธ๊ณ ์์ฒญ์๋ค์ด ๋ชจ๊ตญ์ด๋ก ์ฝํ ์ธ ๋ฅผ ์์ฒญํ ์ ์๊ฒ ๋ฉ๋๋ค.
์์ฃผ ๋ฌป๋ ์ง๋ฌธ
MicrocosmWorks implemented a hybrid tracking approach that combines a lightweight face detector running every 5th frame with a KCF optical flow tracker for inter-frame predictions. When occlusion is detected via confidence score drops, the system maintains the last known trajectory with Kalman filtering and re-acquires the face within 200ms of it becoming visible again.
MicrocosmWorks built a saliency-weighted cropping algorithm that prioritizes detected faces, then text regions, then motion areas when determining the 9:16 crop window position. For multi-person scenes, the system uses a configurable priority ranking, defaulting to the active speaker or the largest face, with smooth interpolation between crop positions to avoid jarring shifts.
Yes, MicrocosmWorks implemented a fallback saliency detection mode that activates when no faces are present, using a combination of motion detection, visual attention modeling, and mouse cursor tracking for screen recordings. The system intelligently follows the most relevant content region even in purely visual or text-based footage.
MicrocosmWorks optimized the pipeline for batch workflows, achieving 8x real-time processing speed on a single NVIDIA T4 GPU, meaning a 10-minute video is reframed in approximately 75 seconds. The system supports parallel processing across multiple GPUs, scaling linearly for high-volume content operations.
MicrocosmWorks develops AI video reframing systems at rates of $25-$45/hr, with a full face tracking and smart reframing solution including model optimization, batch processing support, and API integration typically requiring 350-550 development hours. This investment eliminates the need for manual reframing editors, which typically cost $5-$15 per video.
๋น์ฆ๋์ค ํ์ ์ ์์ํ ์ค๋น๊ฐ ๋์ จ๋์?
๊ทํ์ ๊ณผ์ ์ ์ ์ฌํ ์๋ฃจ์ ์ ์ ์ฉํ๋ ๋ฐฉ๋ฒ์ ๋ํด ๋ ผ์ํด ๋ณด๊ฒ ์ต๋๋ค.