How does the dual orchestrator pattern prevent single points of failure in RTSP streaming infrastructure?

MicrocosmWorks implemented an active-active dual orchestrator design where both orchestrators maintain synchronized state about stream assignments and worker health, with automatic failover that transfers stream management to the surviving orchestrator within seconds if one fails. This eliminates the single point of failure that traditional single-orchestrator designs suffer from, ensuring zero packet drop during orchestrator maintenance or unexpected crashes.

How does the auto-scaling architecture achieve zero packet drop when adding or removing streaming workers?

MicrocosmWorks engineered a graceful drain mechanism where retiring workers continue serving their assigned streams until all connections are cleanly migrated to new workers via RTSP TEARDOWN and re-SETUP sequences. New workers are fully initialized and health-checked before receiving stream assignments, and the transition uses overlapping windows where both old and new workers briefly serve the same stream to prevent any interruption.

Why use MediaMTX as the core streaming server instead of alternatives like Wowza or Ant Media?

MicrocosmWorks selected MediaMTX for this project because it is lightweight, open-source, and designed specifically for RTSP re-streaming with minimal resource overhead per stream compared to full-featured media servers. It supports dynamic stream creation via API, runs efficiently in containers for Kubernetes-based auto-scaling, and avoids the per-stream licensing costs of commercial alternatives like Wowza that can become prohibitive at scale.

What monitoring and alerting is needed to maintain reliability in an auto-scaling RTSP streaming platform?

MicrocosmWorks deployed a comprehensive observability stack that tracks per-stream metrics including packet loss rate, jitter, reconnection count, and end-to-end latency, with alerts that fire before degradation becomes visible to end users. The monitoring system also tracks orchestrator decision-making metrics like scaling events, stream migration durations, and worker utilization trends to enable proactive capacity planning.

Can this auto-scaling RTSP architecture handle both live viewing and simultaneous recording workloads?

Yes, MicrocosmWorks designed the worker nodes to support concurrent RTSP output for live viewers and segmented recording to object storage, with independent resource allocation for each workload. Recording uses a separate write path that buffers segments locally before uploading, so storage I/O spikes never impact live stream delivery, and the auto-scaler accounts for the combined resource demand of both workloads when making scaling decisions.

Auto-Scaling RTSP Streaming Architecture with Dual Orches...

Auto-Scaling RTSP Streaming Architecture with Dual Orchestrators & Zero Packet Drop

A surveillance platform needed to scale its video streaming infrastructure dynamically — handling anywhere from 10 to 200+ IP cameras with hundreds of concurrent viewers and AI processing workers — while guaranteeing zero packet loss during scaling operations and maintaining stable stream URLs that never change.

Bincangkan Projek Anda

Fixed streaming infrastructure couldn't handle the variable demands of a growing surveillance platform:

Scale Variability — Camera count and viewer demand fluctuated dramatically throughout the day (10x peak-to-trough ratio)
Over-Provisioning Cost — Provisioning for peak load meant 70%+ idle resources during off-peak hours
Packet Loss During Scaling — Adding or removing streaming servers caused stream interruptions, dropping frames for AI processing workers
URL Instability — Cameras and viewers configured with specific server IPs needed reconfiguration when infrastructure changed
Different Scaling Needs — Camera ingestion and viewer distribution had fundamentally different load patterns requiring independent scaling
AI Worker Disruption — AI processing pipelines crashed when their source stream server was scaled down

We designed a dual-orchestrator auto-scaling streaming architecture with separate ingestion and distribution clusters, a 5-phase graceful shutdown for zero packet drop, stable DNS-based URLs, and automated AI worker reconnection.

Architecture

Streaming Server: MediaMTX for RTSP/WebRTC/HLS protocol support
Ingestion Cluster: 1-10 servers receiving camera RTSP streams
Distribution Cluster: 2-20 servers serving viewers (WebRTC/HLS) and AI workers (RTSP)
Dual Orchestrators: Independent scaling controllers for ingestion and distribution
Load Balancers: Separate load balancers per cluster with protocol-appropriate algorithms
Service Registry: Redis for server status, stream mappings, and coordination
Health Monitoring: Active health checks with automated recovery
DNS Layer: Stable domain names pointing to load balancers (URLs never change)

Dual Orchestrator Design

Why Two Orchestrators

Ingestion and distribution have fundamentally different scaling characteristics:

Ingestion scales with camera count and inbound bandwidth (predictable, grows steadily)
Distribution scales with viewer count and AI worker demand (bursty, unpredictable)

Separate orchestrators allow each to scale independently with specialized policies, metrics, and thresholds — without one cluster's scaling decisions affecting the other.

Ingestion Orchestrator

Primary Metric: Camera connections per server
Secondary Metric: Inbound bandwidth utilization
Scale Up: When CPU exceeds threshold or cameras per server exceeds capacity
Scale Down: When utilization drops below threshold for a sustained stabilization period
Server Range: 1 to 10 servers

Distribution Orchestrator

Primary Metric: Viewer + AI worker connections per server
Secondary Metric: Outbound bandwidth utilization
Scale Up: When CPU exceeds threshold or connections per server exceed capacity
Scale Down: When utilization drops below threshold for a sustained period (longer stabilization than ingestion)
Server Range: 2 to 20 servers (minimum 2 for high availability)

Zero Packet Drop: 5-Phase Graceful Shutdown

When a distribution server is scheduled for removal, a 5-phase process ensures no frames are lost:

Phase 1: Pre-Notification

Server marked as "DRAINING" in the service registry. Load balancer weight reduced so new connections route elsewhere. Redis pub/sub notifications and webhooks alert AI workers to prepare for migration.

Phase 2: Load Balancer Update

Server removed from the load balancer backend pool. No new connections can reach the draining server. Existing connections continue uninterrupted.

Phase 3: AI Worker Migration

AI workers disconnect from the draining server and reconnect to healthy distribution servers. Checkpoint-based state preservation ensures processing resumes from the exact frame where it left off. Total gap: approximately 3 seconds with zero frames lost.

Phase 4: Viewer Draining

Remaining viewer connections drain naturally over a configurable window. Modern video players auto-reconnect to the same stable URL, which routes to healthy servers. Most viewers experience no interruption.

Phase 5: Cleanup

Verify all connections have closed. Remove server from the service registry. Destroy the cloud instance. Record scaling metrics.

Stable URLs

The URL architecture ensures cameras and clients never need reconfiguration:

Camera publish target: A stable ingestion domain name
Viewer/AI access target: A stable distribution domain name
DNS records point to load balancer IPs (which are permanent)
Load balancers handle routing to backend servers transparently
Backend servers can be added, removed, or replaced without URL changes

Service Registry (Redis)

A centralized Redis instance coordinates the entire system:

Server status tracking (active, draining, offline)
Stream-to-server mapping (which camera is on which ingestion server)
AI worker state and checkpoint data
Load metrics per server for scaling decisions
Pub/sub channels for real-time coordination events

AI Client Reconnection

An AI client library provides seamless reconnection:

Listens for server removal notifications via Redis pub/sub
Automatic frame checkpointing at regular intervals
Reconnection to a healthy distribution server on notification
Resume processing from checkpoint with minimal gap
Metrics reporting for reconnection events

Health Monitoring

Active health checks on every server at regular intervals
Automatic load balancer updates on server failures
Auto-recovery triggers for unresponsive servers
Uptime tracking and availability reporting

Key Features

Dual Orchestrators — Independent scaling for ingestion and distribution clusters
Zero Packet Drop — 5-phase graceful shutdown with AI worker migration
Stable URLs — DNS-based routing ensures URLs never change during scaling
AI Worker Reconnection — Checkpoint-based migration with ~3 second gap and zero frame loss
Independent Scaling — Ingestion and distribution scale based on their own metrics
Service Registry — Redis-based coordination for server status and stream mappings
Health Monitoring — Active checks with automatic recovery
Cost Optimization — Automatic scale-down during low-demand periods

Auto-Scaling RTSP Streaming Architecture with Dual Orchestrators & Zero Packet Drop

Cabaran

Penyelesaian Kami

Architecture

Dual Orchestrator Design

Why Two Orchestrators

Ingestion Orchestrator

Distribution Orchestrator

Zero Packet Drop: 5-Phase Graceful Shutdown

Stable URLs

Service Registry (Redis)

AI Client Reconnection

Health Monitoring

Key Features

Keputusan

Timbunan Teknologi

caseStudyDetail.more Kajian Kes

Pemprosesan Invois Berkuasa AI dengan OCR dan Integrasi QuickBooks

Penyisipan Iklan Sisi Klien (CSAI) dengan Penghuraian Penanda SCTE-35 & Integrasi Pemain Berbilang Platform

Bersedia untuk Mentransformasi Perniagaan Anda?

Platform Pengikisan & Penjanaan Kandungan Blog Dikuasakan AI

Soalan Lazim