Auto-Scaling RTSP Streaming Architecture with Dual Orchestrators & Zero Packet Drop
A surveillance platform needed to scale its video streaming infrastructure dynamically — handling anywhere from 10 to 200+ IP cameras with hundreds of concurrent viewers and AI processing workers — while guaranteeing zero packet loss during scaling operations and maintaining stable stream URLs that never change.
Discuss Your Project
The Challenge
Fixed streaming infrastructure couldn't handle the variable demands of a growing surveillance platform:
- Scale Variability — Camera count and viewer demand fluctuated dramatically throughout the day (10x peak-to-trough ratio)
- Over-Provisioning Cost — Provisioning for peak load meant 70%+ idle resources during off-peak hours
- Packet Loss During Scaling — Adding or removing streaming servers caused stream interruptions, dropping frames for AI processing workers
- URL Instability — Cameras and viewers configured with specific server IPs needed reconfiguration when infrastructure changed
- Different Scaling Needs — Camera ingestion and viewer distribution had fundamentally different load patterns requiring independent scaling
- AI Worker Disruption — AI processing pipelines crashed when their source stream server was scaled down
Our Solution
We designed a dual-orchestrator auto-scaling streaming architecture with separate ingestion and distribution clusters, a 5-phase graceful shutdown for zero packet drop, stable DNS-based URLs, and automated AI worker reconnection.
Architecture
- Streaming Server: MediaMTX for RTSP/WebRTC/HLS protocol support
- Ingestion Cluster: 1-10 servers receiving camera RTSP streams
- Distribution Cluster: 2-20 servers serving viewers (WebRTC/HLS) and AI workers (RTSP)
- Dual Orchestrators: Independent scaling controllers for ingestion and distribution
- Load Balancers: Separate load balancers per cluster with protocol-appropriate algorithms
- Service Registry: Redis for server status, stream mappings, and coordination
- Health Monitoring: Active health checks with automated recovery
- DNS Layer: Stable domain names pointing to load balancers (URLs never change)
Dual Orchestrator Design
Why Two Orchestrators
Ingestion and distribution have fundamentally different scaling characteristics:
- Ingestion scales with camera count and inbound bandwidth (predictable, grows steadily)
- Distribution scales with viewer count and AI worker demand (bursty, unpredictable)
Separate orchestrators allow each to scale independently with specialized policies, metrics, and thresholds — without one cluster's scaling decisions affecting the other.
Ingestion Orchestrator
- Primary Metric: Camera connections per server
- Secondary Metric: Inbound bandwidth utilization
- Scale Up: When CPU exceeds threshold or cameras per server exceeds capacity
- Scale Down: When utilization drops below threshold for a sustained stabilization period
- Server Range: 1 to 10 servers
Distribution Orchestrator
- Primary Metric: Viewer + AI worker connections per server
- Secondary Metric: Outbound bandwidth utilization
- Scale Up: When CPU exceeds threshold or connections per server exceed capacity
- Scale Down: When utilization drops below threshold for a sustained period (longer stabilization than ingestion)
- Server Range: 2 to 20 servers (minimum 2 for high availability)
Zero Packet Drop: 5-Phase Graceful Shutdown
When a distribution server is scheduled for removal, a 5-phase process ensures no frames are lost:
Phase 1: Pre-NotificationServer marked as "DRAINING" in the service registry. Load balancer weight reduced so new connections route elsewhere. Redis pub/sub notifications and webhooks alert AI workers to prepare for migration.
Phase 2: Load Balancer UpdateServer removed from the load balancer backend pool. No new connections can reach the draining server. Existing connections continue uninterrupted.
Phase 3: AI Worker MigrationAI workers disconnect from the draining server and reconnect to healthy distribution servers. Checkpoint-based state preservation ensures processing resumes from the exact frame where it left off. Total gap: approximately 3 seconds with zero frames lost.
Phase 4: Viewer DrainingRemaining viewer connections drain naturally over a configurable window. Modern video players auto-reconnect to the same stable URL, which routes to healthy servers. Most viewers experience no interruption.
Phase 5: CleanupVerify all connections have closed. Remove server from the service registry. Destroy the cloud instance. Record scaling metrics.
Stable URLs
The URL architecture ensures cameras and clients never need reconfiguration:
- Camera publish target: A stable ingestion domain name
- Viewer/AI access target: A stable distribution domain name
- DNS records point to load balancer IPs (which are permanent)
- Load balancers handle routing to backend servers transparently
- Backend servers can be added, removed, or replaced without URL changes
Service Registry (Redis)
A centralized Redis instance coordinates the entire system:
- Server status tracking (active, draining, offline)
- Stream-to-server mapping (which camera is on which ingestion server)
- AI worker state and checkpoint data
- Load metrics per server for scaling decisions
- Pub/sub channels for real-time coordination events
AI Client Reconnection
An AI client library provides seamless reconnection:
- Listens for server removal notifications via Redis pub/sub
- Automatic frame checkpointing at regular intervals
- Reconnection to a healthy distribution server on notification
- Resume processing from checkpoint with minimal gap
- Metrics reporting for reconnection events
Health Monitoring
- Active health checks on every server at regular intervals
- Automatic load balancer updates on server failures
- Auto-recovery triggers for unresponsive servers
- Uptime tracking and availability reporting
Key Features
- Dual Orchestrators — Independent scaling for ingestion and distribution clusters
- Zero Packet Drop — 5-phase graceful shutdown with AI worker migration
- Stable URLs — DNS-based routing ensures URLs never change during scaling
- AI Worker Reconnection — Checkpoint-based migration with ~3 second gap and zero frame loss
- Independent Scaling — Ingestion and distribution scale based on their own metrics
- Service Registry — Redis-based coordination for server status and stream mappings
- Health Monitoring — Active checks with automatic recovery
- Cost Optimization — Automatic scale-down during low-demand periods
Results
Technology Stack
More Case Studies
Explore more of our technical implementations
RTSP Streaming over VPN with Auto-Scaling Restreaming, HLS Delivery & Recording
A surveillance platform needed to securely ingest RTSP camera feeds from remote locations over VPN tunnels, restream them for web-based viewing and AI processing, auto-scale the restreaming infrastructure based on demand, and record streams for archival — all while maintaining low latency and reliable connectivity across unpredictable network conditions.
AI-Powered Blog Content Scraping & Generation Platform
A media company needed an intelligent content platform that could automate blog content creation by scraping existing web content, analyzing it using AI, and generating original, SEO-optimized blog posts from the extracted data.
Automated B2B Supplier Data Collection Platform with Anti-Detection & IP Rotation
A sourcing team needed to build a comprehensive supplier database across 19+ product categories and 50+ countries by collecting structured business data from B2B marketplace platforms — at scale, reliably, and without being blocked.
Have a Similar Project in Mind?
Let's discuss how we can build a solution tailored to your needs.