Auto-Scaling RTSP Streaming Architecture with Dual Orchestrators & Zero Packet Drop
A surveillance platform needed to scale its video streaming infrastructure dynamically — handling anywhere from 10 to 200+ IP cameras with hundreds of concurrent viewers and AI processing workers — while guaranteeing zero packet loss during scaling operations and maintaining stable stream URLs that never change.
Discutez de Votre Projet
Le Défi
Fixed streaming infrastructure couldn't handle the variable demands of a growing surveillance platform:
- Scale Variability — Camera count and viewer demand fluctuated dramatically throughout the day (10x peak-to-trough ratio)
- Over-Provisioning Cost — Provisioning for peak load meant 70%+ idle resources during off-peak hours
- Packet Loss During Scaling — Adding or removing streaming servers caused stream interruptions, dropping frames for AI processing workers
- URL Instability — Cameras and viewers configured with specific server IPs needed reconfiguration when infrastructure changed
- Different Scaling Needs — Camera ingestion and viewer distribution had fundamentally different load patterns requiring independent scaling
- AI Worker Disruption — AI processing pipelines crashed when their source stream server was scaled down
Notre Solution
We designed a dual-orchestrator auto-scaling streaming architecture with separate ingestion and distribution clusters, a 5-phase graceful shutdown for zero packet drop, stable DNS-based URLs, and automated AI worker reconnection.
Architecture
- Streaming Server: MediaMTX for RTSP/WebRTC/HLS protocol support
- Ingestion Cluster: 1-10 servers receiving camera RTSP streams
- Distribution Cluster: 2-20 servers serving viewers (WebRTC/HLS) and AI workers (RTSP)
- Dual Orchestrators: Independent scaling controllers for ingestion and distribution
- Load Balancers: Separate load balancers per cluster with protocol-appropriate algorithms
- Service Registry: Redis for server status, stream mappings, and coordination
- Health Monitoring: Active health checks with automated recovery
- DNS Layer: Stable domain names pointing to load balancers (URLs never change)
Dual Orchestrator Design
Why Two Orchestrators
Ingestion and distribution have fundamentally different scaling characteristics:
- Ingestion scales with camera count and inbound bandwidth (predictable, grows steadily)
- Distribution scales with viewer count and AI worker demand (bursty, unpredictable)
Separate orchestrators allow each to scale independently with specialized policies, metrics, and thresholds — without one cluster's scaling decisions affecting the other.
Ingestion Orchestrator
- Primary Metric: Camera connections per server
- Secondary Metric: Inbound bandwidth utilization
- Scale Up: When CPU exceeds threshold or cameras per server exceeds capacity
- Scale Down: When utilization drops below threshold for a sustained stabilization period
- Server Range: 1 to 10 servers
Distribution Orchestrator
- Primary Metric: Viewer + AI worker connections per server
- Secondary Metric: Outbound bandwidth utilization
- Scale Up: When CPU exceeds threshold or connections per server exceed capacity
- Scale Down: When utilization drops below threshold for a sustained period (longer stabilization than ingestion)
- Server Range: 2 to 20 servers (minimum 2 for high availability)
Zero Packet Drop: 5-Phase Graceful Shutdown
When a distribution server is scheduled for removal, a 5-phase process ensures no frames are lost:
Phase 1: Pre-NotificationServer marked as "DRAINING" in the service registry. Load balancer weight reduced so new connections route elsewhere. Redis pub/sub notifications and webhooks alert AI workers to prepare for migration.
Phase 2: Load Balancer UpdateServer removed from the load balancer backend pool. No new connections can reach the draining server. Existing connections continue uninterrupted.
Phase 3: AI Worker MigrationAI workers disconnect from the draining server and reconnect to healthy distribution servers. Checkpoint-based state preservation ensures processing resumes from the exact frame where it left off. Total gap: approximately 3 seconds with zero frames lost.
Phase 4: Viewer DrainingRemaining viewer connections drain naturally over a configurable window. Modern video players auto-reconnect to the same stable URL, which routes to healthy servers. Most viewers experience no interruption.
Phase 5: CleanupVerify all connections have closed. Remove server from the service registry. Destroy the cloud instance. Record scaling metrics.
Stable URLs
The URL architecture ensures cameras and clients never need reconfiguration:
- Camera publish target: A stable ingestion domain name
- Viewer/AI access target: A stable distribution domain name
- DNS records point to load balancer IPs (which are permanent)
- Load balancers handle routing to backend servers transparently
- Backend servers can be added, removed, or replaced without URL changes
Service Registry (Redis)
A centralized Redis instance coordinates the entire system:
- Server status tracking (active, draining, offline)
- Stream-to-server mapping (which camera is on which ingestion server)
- AI worker state and checkpoint data
- Load metrics per server for scaling decisions
- Pub/sub channels for real-time coordination events
AI Client Reconnection
An AI client library provides seamless reconnection:
- Listens for server removal notifications via Redis pub/sub
- Automatic frame checkpointing at regular intervals
- Reconnection to a healthy distribution server on notification
- Resume processing from checkpoint with minimal gap
- Metrics reporting for reconnection events
Health Monitoring
- Active health checks on every server at regular intervals
- Automatic load balancer updates on server failures
- Auto-recovery triggers for unresponsive servers
- Uptime tracking and availability reporting
Key Features
- Dual Orchestrators — Independent scaling for ingestion and distribution clusters
- Zero Packet Drop — 5-phase graceful shutdown with AI worker migration
- Stable URLs — DNS-based routing ensures URLs never change during scaling
- AI Worker Reconnection — Checkpoint-based migration with ~3 second gap and zero frame loss
- Independent Scaling — Ingestion and distribution scale based on their own metrics
- Service Registry — Redis-based coordination for server status and stream mappings
- Health Monitoring — Active checks with automatic recovery
- Cost Optimization — Automatic scale-down during low-demand periods
Résultats
Stack Technologique
caseStudyDetail.more Études de Cas
Découvrez plus de nos implémentations techniques
Traitement de factures assisté par l'IA avec OCR et intégration QuickBooks
Une entreprise de taille moyenne, traitant des centaines de factures fournisseurs chaque mois, devait éliminer la saisie manuelle des données en extrayant automatiquement les données des factures à l'aide de l'IA/OCR et en les synchronisant directement dans QuickBooks pour la tenue de livres et le suivi des paiements.
Insertion d'annonces côté client (CSAI) avec analyse des marqueurs SCTE-35 et intégration de lecteurs multiplateformes
Une plateforme de streaming vidéo devait implémenter l'insertion d'annonces côté client (CSAI) sur les applications web, mobiles et de télévision connectée — permettant des expériences publicitaires personnalisées au niveau de l'appareil avec un support complet d'interaction publicitaire (superpositions cliquables, bannières complémentaires, boutons de saut) que l'insertion côté serveur ne peut pas offrir.
Prêt à Transformer Votre Entreprise ?
Discutons de la façon dont nous pouvons appliquer des solutions similaires à vos défis.