Auto-Scaling RTSP Streaming Architecture with Dual Orchestrators & Zero Packet Drop
A surveillance platform needed to scale its video streaming infrastructure dynamically โ handling anywhere from 10 to 200+ IP cameras with hundreds of concurrent viewers and AI processing workers โ while guaranteeing zero packet loss during scaling operations and maintaining stable stream URLs that never change.
Pag-usapan ang Iyong Proyekto
Ang Hamon
Fixed streaming infrastructure couldn't handle the variable demands of a growing surveillance platform:
- Scale Variability โ Camera count and viewer demand fluctuated dramatically throughout the day (10x peak-to-trough ratio)
- Over-Provisioning Cost โ Provisioning for peak load meant 70%+ idle resources during off-peak hours
- Packet Loss During Scaling โ Adding or removing streaming servers caused stream interruptions, dropping frames for AI processing workers
- URL Instability โ Cameras and viewers configured with specific server IPs needed reconfiguration when infrastructure changed
- Different Scaling Needs โ Camera ingestion and viewer distribution had fundamentally different load patterns requiring independent scaling
- AI Worker Disruption โ AI processing pipelines crashed when their source stream server was scaled down
Ang Aming Solusyon
We designed a dual-orchestrator auto-scaling streaming architecture with separate ingestion and distribution clusters, a 5-phase graceful shutdown for zero packet drop, stable DNS-based URLs, and automated AI worker reconnection.
Architecture
- Streaming Server: MediaMTX for RTSP/WebRTC/HLS protocol support
- Ingestion Cluster: 1-10 servers receiving camera RTSP streams
- Distribution Cluster: 2-20 servers serving viewers (WebRTC/HLS) and AI workers (RTSP)
- Dual Orchestrators: Independent scaling controllers for ingestion and distribution
- Load Balancers: Separate load balancers per cluster with protocol-appropriate algorithms
- Service Registry: Redis for server status, stream mappings, and coordination
- Health Monitoring: Active health checks with automated recovery
- DNS Layer: Stable domain names pointing to load balancers (URLs never change)
Dual Orchestrator Design
Why Two Orchestrators
Ingestion and distribution have fundamentally different scaling characteristics:
- Ingestion scales with camera count and inbound bandwidth (predictable, grows steadily)
- Distribution scales with viewer count and AI worker demand (bursty, unpredictable)
Separate orchestrators allow each to scale independently with specialized policies, metrics, and thresholds โ without one cluster's scaling decisions affecting the other.
Ingestion Orchestrator
- Primary Metric: Camera connections per server
- Secondary Metric: Inbound bandwidth utilization
- Scale Up: When CPU exceeds threshold or cameras per server exceeds capacity
- Scale Down: When utilization drops below threshold for a sustained stabilization period
- Server Range: 1 to 10 servers
Distribution Orchestrator
- Primary Metric: Viewer + AI worker connections per server
- Secondary Metric: Outbound bandwidth utilization
- Scale Up: When CPU exceeds threshold or connections per server exceed capacity
- Scale Down: When utilization drops below threshold for a sustained period (longer stabilization than ingestion)
- Server Range: 2 to 20 servers (minimum 2 for high availability)
Zero Packet Drop: 5-Phase Graceful Shutdown
When a distribution server is scheduled for removal, a 5-phase process ensures no frames are lost:
Phase 1: Pre-NotificationServer marked as "DRAINING" in the service registry. Load balancer weight reduced so new connections route elsewhere. Redis pub/sub notifications and webhooks alert AI workers to prepare for migration.
Phase 2: Load Balancer UpdateServer removed from the load balancer backend pool. No new connections can reach the draining server. Existing connections continue uninterrupted.
Phase 3: AI Worker MigrationAI workers disconnect from the draining server and reconnect to healthy distribution servers. Checkpoint-based state preservation ensures processing resumes from the exact frame where it left off. Total gap: approximately 3 seconds with zero frames lost.
Phase 4: Viewer DrainingRemaining viewer connections drain naturally over a configurable window. Modern video players auto-reconnect to the same stable URL, which routes to healthy servers. Most viewers experience no interruption.
Phase 5: CleanupVerify all connections have closed. Remove server from the service registry. Destroy the cloud instance. Record scaling metrics.
Stable URLs
The URL architecture ensures cameras and clients never need reconfiguration:
- Camera publish target: A stable ingestion domain name
- Viewer/AI access target: A stable distribution domain name
- DNS records point to load balancer IPs (which are permanent)
- Load balancers handle routing to backend servers transparently
- Backend servers can be added, removed, or replaced without URL changes
Service Registry (Redis)
A centralized Redis instance coordinates the entire system:
- Server status tracking (active, draining, offline)
- Stream-to-server mapping (which camera is on which ingestion server)
- AI worker state and checkpoint data
- Load metrics per server for scaling decisions
- Pub/sub channels for real-time coordination events
AI Client Reconnection
An AI client library provides seamless reconnection:
- Listens for server removal notifications via Redis pub/sub
- Automatic frame checkpointing at regular intervals
- Reconnection to a healthy distribution server on notification
- Resume processing from checkpoint with minimal gap
- Metrics reporting for reconnection events
Health Monitoring
- Active health checks on every server at regular intervals
- Automatic load balancer updates on server failures
- Auto-recovery triggers for unresponsive servers
- Uptime tracking and availability reporting
Key Features
- Dual Orchestrators โ Independent scaling for ingestion and distribution clusters
- Zero Packet Drop โ 5-phase graceful shutdown with AI worker migration
- Stable URLs โ DNS-based routing ensures URLs never change during scaling
- AI Worker Reconnection โ Checkpoint-based migration with ~3 second gap and zero frame loss
- Independent Scaling โ Ingestion and distribution scale based on their own metrics
- Service Registry โ Redis-based coordination for server status and stream mappings
- Health Monitoring โ Active checks with automatic recovery
- Cost Optimization โ Automatic scale-down during low-demand periods
Mga Resulta
Technology Stack
caseStudyDetail.more Mga Case Study
Tuklasin ang higit pa sa aming mga teknikal na implementasyon
Pagpoproseso ng Invoice na Pinapagana ng AI gamit ang OCR at Integrasyon ng QuickBooks
Isang katamtamang laking negosyo na nagpoproseso ng daan-daang invoice ng vendor buwan-buwan ang kinailangan alisin ang manu-manong pagpasok ng data sa pamamagitan ng awtomatikong pagkuha ng data ng invoice gamit ang AI/OCR at direktang i-sync ito sa QuickBooks para sa bookkeeping at pagsubaybay sa pagbabayad.
Client-Side Ad Insertion (CSAI) na may pag-parse ng SCTE-35 Marker at Integrasyon ng Multi-Platform Player
Isang platform para sa video streaming ay nangangailangan na magpatupad ng Client-Side Ad Insertion (CSAI) sa mga web, mobile, at connected TV apps โ na nagbibigay-daan sa mga personalized, device-level na karanasan sa ad na may buong suporta sa interaksyon ng ad (mga clickable overlay, companion banner, skip button) na hindi kayang ibigay ng server-side insertion.
Handa nang Baguhin ang Iyong Negosyo?
Pag-usapan natin kung paano namin mailalapat ang katulad na mga solusyon sa iyong mga hamon.