Auto-Scaling RTSP Streaming Architecture with Dual Orchestrators & Zero Packet Drop
A surveillance platform needed to scale its video streaming infrastructure dynamically โ handling anywhere from 10 to 200+ IP cameras with hundreds of concurrent viewers and AI processing workers โ while guaranteeing zero packet loss during scaling operations and maintaining stable stream URLs that never change.
Bincangkan Projek Anda
Cabaran
Fixed streaming infrastructure couldn't handle the variable demands of a growing surveillance platform:
- Scale Variability โ Camera count and viewer demand fluctuated dramatically throughout the day (10x peak-to-trough ratio)
- Over-Provisioning Cost โ Provisioning for peak load meant 70%+ idle resources during off-peak hours
- Packet Loss During Scaling โ Adding or removing streaming servers caused stream interruptions, dropping frames for AI processing workers
- URL Instability โ Cameras and viewers configured with specific server IPs needed reconfiguration when infrastructure changed
- Different Scaling Needs โ Camera ingestion and viewer distribution had fundamentally different load patterns requiring independent scaling
- AI Worker Disruption โ AI processing pipelines crashed when their source stream server was scaled down
Penyelesaian Kami
We designed a dual-orchestrator auto-scaling streaming architecture with separate ingestion and distribution clusters, a 5-phase graceful shutdown for zero packet drop, stable DNS-based URLs, and automated AI worker reconnection.
Architecture
- Streaming Server: MediaMTX for RTSP/WebRTC/HLS protocol support
- Ingestion Cluster: 1-10 servers receiving camera RTSP streams
- Distribution Cluster: 2-20 servers serving viewers (WebRTC/HLS) and AI workers (RTSP)
- Dual Orchestrators: Independent scaling controllers for ingestion and distribution
- Load Balancers: Separate load balancers per cluster with protocol-appropriate algorithms
- Service Registry: Redis for server status, stream mappings, and coordination
- Health Monitoring: Active health checks with automated recovery
- DNS Layer: Stable domain names pointing to load balancers (URLs never change)
Dual Orchestrator Design
Why Two Orchestrators
Ingestion and distribution have fundamentally different scaling characteristics:
- Ingestion scales with camera count and inbound bandwidth (predictable, grows steadily)
- Distribution scales with viewer count and AI worker demand (bursty, unpredictable)
Separate orchestrators allow each to scale independently with specialized policies, metrics, and thresholds โ without one cluster's scaling decisions affecting the other.
Ingestion Orchestrator
- Primary Metric: Camera connections per server
- Secondary Metric: Inbound bandwidth utilization
- Scale Up: When CPU exceeds threshold or cameras per server exceeds capacity
- Scale Down: When utilization drops below threshold for a sustained stabilization period
- Server Range: 1 to 10 servers
Distribution Orchestrator
- Primary Metric: Viewer + AI worker connections per server
- Secondary Metric: Outbound bandwidth utilization
- Scale Up: When CPU exceeds threshold or connections per server exceed capacity
- Scale Down: When utilization drops below threshold for a sustained period (longer stabilization than ingestion)
- Server Range: 2 to 20 servers (minimum 2 for high availability)
Zero Packet Drop: 5-Phase Graceful Shutdown
When a distribution server is scheduled for removal, a 5-phase process ensures no frames are lost:
Phase 1: Pre-NotificationServer marked as "DRAINING" in the service registry. Load balancer weight reduced so new connections route elsewhere. Redis pub/sub notifications and webhooks alert AI workers to prepare for migration.
Phase 2: Load Balancer UpdateServer removed from the load balancer backend pool. No new connections can reach the draining server. Existing connections continue uninterrupted.
Phase 3: AI Worker MigrationAI workers disconnect from the draining server and reconnect to healthy distribution servers. Checkpoint-based state preservation ensures processing resumes from the exact frame where it left off. Total gap: approximately 3 seconds with zero frames lost.
Phase 4: Viewer DrainingRemaining viewer connections drain naturally over a configurable window. Modern video players auto-reconnect to the same stable URL, which routes to healthy servers. Most viewers experience no interruption.
Phase 5: CleanupVerify all connections have closed. Remove server from the service registry. Destroy the cloud instance. Record scaling metrics.
Stable URLs
The URL architecture ensures cameras and clients never need reconfiguration:
- Camera publish target: A stable ingestion domain name
- Viewer/AI access target: A stable distribution domain name
- DNS records point to load balancer IPs (which are permanent)
- Load balancers handle routing to backend servers transparently
- Backend servers can be added, removed, or replaced without URL changes
Service Registry (Redis)
A centralized Redis instance coordinates the entire system:
- Server status tracking (active, draining, offline)
- Stream-to-server mapping (which camera is on which ingestion server)
- AI worker state and checkpoint data
- Load metrics per server for scaling decisions
- Pub/sub channels for real-time coordination events
AI Client Reconnection
An AI client library provides seamless reconnection:
- Listens for server removal notifications via Redis pub/sub
- Automatic frame checkpointing at regular intervals
- Reconnection to a healthy distribution server on notification
- Resume processing from checkpoint with minimal gap
- Metrics reporting for reconnection events
Health Monitoring
- Active health checks on every server at regular intervals
- Automatic load balancer updates on server failures
- Auto-recovery triggers for unresponsive servers
- Uptime tracking and availability reporting
Key Features
- Dual Orchestrators โ Independent scaling for ingestion and distribution clusters
- Zero Packet Drop โ 5-phase graceful shutdown with AI worker migration
- Stable URLs โ DNS-based routing ensures URLs never change during scaling
- AI Worker Reconnection โ Checkpoint-based migration with ~3 second gap and zero frame loss
- Independent Scaling โ Ingestion and distribution scale based on their own metrics
- Service Registry โ Redis-based coordination for server status and stream mappings
- Health Monitoring โ Active checks with automatic recovery
- Cost Optimization โ Automatic scale-down during low-demand periods
Keputusan
Timbunan Teknologi
caseStudyDetail.more Kajian Kes
Terokai lebih banyak pelaksanaan teknikal kami
Pemprosesan Invois Berkuasa AI dengan OCR dan Integrasi QuickBooks
Sebuah perniagaan bersaiz sederhana yang memproses ratusan invois vendor setiap bulan perlu menghapuskan kemasukan data manual dengan mengekstrak data invois secara automatik menggunakan AI/OCR dan menyegerakkannya terus ke dalam QuickBooks untuk tujuan simpan kira dan penjejakan pembayaran.
Penyisipan Iklan Sisi Klien (CSAI) dengan Penghuraian Penanda SCTE-35 & Integrasi Pemain Berbilang Platform
Sebuah platform penstriman video perlu melaksanakan Client-Side Ad Insertion (CSAI) merentasi aplikasi web, mudah alih, dan TV bersambung โ membolehkan pengalaman iklan yang diperibadikan pada peringkat peranti dengan sokongan interaksi iklan penuh (lapisan tindanan boleh klik, sepanduk pendamping, butang langkau) yang tidak dapat disediakan oleh penyisipan sisi pelayan.
Bersedia untuk Mentransformasi Perniagaan Anda?
Mari bincangkan bagaimana kami boleh mengaplikasikan penyelesaian serupa untuk cabaran anda.