MicrocosmWorksInnovere og Arkitektere Digitale Kosmos
OmKontakt
MicrocosmWorksInnoverer og arkitekterer digitale kosmos

Leverer IT-lΓΈsninger, der betyder noget. Vi brΓ¦nder for teknologi, sikkerhed og at hjΓ¦lpe virksomheder med at vokse gennem pΓ₯lidelig, innovativ IT-infrastruktur.

[email protected]
+91 7011868196
New Delhi, India

AI Væksthub

AI HubStartup-innovationVirksomhedsaccelerator

LΓΈsninger

Alle lΓΈsningerSundhed & Fitness AppsAI VideoplatformAI Agentudvikling

Ressourcer

IndsigterIndustri GuiderBrugssag BlueprintsArkitektur MΓΈnstreCase Studier

Virksomhed

Om OsKontaktVores Arbejde

Tjenester

Digital RΓ₯dgivningCloud InfrastrukturSaaS UdviklingAI UdviklingVideo Teknologi
ERP UdviklingZoho TilpasningOdoo UdviklingSalesforce IntegrationTilpasset CRM Udvikling
QuickBooks IntegrationIoT LΓΈsningerBlockchain Udvikling
Cybersikkerhed RΓ₯dgivningIT-support - L3

Β© 2026 MicrocosmWorks. Alle rettigheder forbeholdes.

PrivatlivspolitikServicevilkΓ₯r
Tilbage til indsigter
Cloud Solutions

How We Built One-Click Program Scheduling by Overcoming AWS Lambda's 15-Minute Limit

WS Lambda's 15-minute execution cap seems like a hard wall for long-running scheduled jobs. Here's how we redesigned our architecture to deliver seamless one-click program scheduling anyway.

Pankaj
β€’
July 1, 2026
β€’
Opdateret July 1, 2026
β€’
5 minutes
5 minutes

How We Built One-Click Program Scheduling by Overcoming AWS Lambda's 15-Minute Limit

A FAST channel operator schedules a month of programming and hits Deploy once. Behind that single click, hundreds of programs become thousands of MediaLive schedule actions that have to land in AWS in order, all inside Lambda's fifteen-minute execution ceiling β€” and any failure in the middle leaves a live channel with holes in it. Here is how we made one-click deployment of large playlists reliable, and why the next step out is a different runtime, not a bigger Lambda.



 

Quick overview

AspectDetail
RuntimeAWS Lambda (single invocation), 615s backend timeout
Output targetMediaLive BatchUpdateScheduleCommand
BatchingUp to 200 MediaLive actions per request β€” ~25 programs in practice
Per-program actions~7–8 (input switch + 4 per-rendition watermarks + 2 SCTE-35), more with ad breaks
FallbackPer-program retry on any batch rejection
Observed195 programs deployed in ~44 seconds (local measurement)
Documented target360 programs in ≀90 seconds
Scale-out pathStep Functions chunking for >1-month playlists β€” planned, not shipped



 

The Challenge

Deploying a playlist isn't a write β€” it's an orchestration. For every program the operator scheduled, MediaLive needs to know exactly when to switch inputs, when to turn the watermark on for each rendition, when to insert SCTE-35 cue points, and when to splice ads. The structural pressures are:



 

  • The action count is multiplicative, not additive. Each program emits ~7–8 actions before ad breaks β€” one input switch, four watermark activations (one per rendition because StaticImageOutputActivate uses output-pixel coordinates), and two SCTE-35 markers. Ad breaks add three more actions each. A 360-program month is roughly 2,800 actions on the wire.
  • MediaLive enforces ordering per channel. Schedule actions are time-anchored and reference each other; you cannot parallelize writes to the same channel without the API rejecting them as conflicting.
  • AWS Lambda has a hard 15-minute ceiling. Not a soft limit, not a configuration. The orchestrator must finish inside that wall or the channel ends up half-deployed.
  • Partial failure is operationally unacceptable. If program 174 of 360 fails and aborts the run, the operator has no diff view of what landed and what didn't. The channel goes live with gaps; viewers see slate where they expected content.
  • The first version sent one BatchUpdateSchedule per program with a 200ms sleep between programs. That's ~2.5 seconds of wall time per program. At 360 programs you are already past the Lambda ceiling before MediaLive has done any real work.



 

The job, then, is not write faster. It is write fewer times, survive partial failure, and stay inside one invocation β€” without losing the ordering guarantees MediaLive demands.



 

Why Existing Approaches Fail

The obvious escapes all fail for reasons that are structural, not tuning problems.



 

  • "Just raise the Lambda timeout." You can't. Fifteen minutes is an AWS-imposed hard ceiling on Lambda execution; it isn't a knob in the console. Even if it were, the per-program cost grows with the catalog β€” buying more wall time only delays the next wall.
  • "Move to ECS Fargate or EC2." We considered it and rejected it. Long-running containers mean we own the runtime: health checks, autoscaling, cold-start vs. warm-pool tradeoffs, IAM scoping, and on-call rotation for a service that runs in bursts. Lambda gives us per-invocation isolation and zero idle cost for an inherently bursty workload. We weren't ready to give that up to fix one bottleneck.
  • "Parallelise the MediaLive writes." Schedule updates are serialised by MediaLive for each channel. Concurrent BatchUpdateSchedule calls against the same channel race on the action timeline and get rejected. The only legitimate parallelism is within a batch, not across batches.
  • "Just let it crash and let the operator retry." This is the worst option. When a deploy aborts at program N, the channel is in a state nobody can describe from the UI. Operators don't get a diff; they get a black box and a live channel with holes in it. The system has to either land everything or land partial results with per-program status the operator can act on.



 

The lever we had left was the shape of the work itself: fewer, larger writes, executed serially, with a fallback that degrades to row-level granularity only when a batch rejects.



 

Our Solution

Move the cost out of Lambda before the loop starts, coalesce the MediaLive writes inside the loop, and degrade to per-program submission only when a batch fails. Three ideas, in that order β€” and a deliberate decision that the 15-minute ceiling is fine for the catalog sizes we actually deploy today. When playlists outgrow a single invocation, the answer isn't a bigger Lambda; it's a different runtime.



 

Backend (NestJS)                          Lambda (Node 18)                      MediaLive



 

─────────────────                         ────────────────                      ─────────



 

deployToLambda()                          fastChannel-lambda-fun



 

  β”œβ”€ bulk fetch videos + adMarkers  ─┐



 

  β”œβ”€ parallel ffprobe (unique URLs)  β”œβ”€β–Ί invoke (615s timeout)



 

  β””─ enriched payload β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚



 

                                           β”œβ”€ for each program:



 

                                           β”‚   buildProgramActions()  ~7–8 actions



 

                                           β”‚   batchActions.push(...)



 

                                           β”‚   if >=200 actions queued:



 

                                           β”‚     submitProgramBatch() ──────────► BatchUpdateSchedule



 

                                           β”‚     on failure β†’ retryBatchAsIndividuals()



 

                                           β”‚     sleep(200ms)



 

                                           β””─ sweepIncompleteProgramGroups()



 

Architecture

  • NestJS backend (schedule.service.ts) β€” pre-warms the deploy: one Mongo round trip for all unique videos, one for all ad markers, then parallel ffprobe across unique video URLs with the results cached for the enrichment loop.
  • Lambda orchestrator (fastChannel-lambda-fun/index.js) β€” owns the batching loop, the per-batch throttle, the per-program fallback, and the orphan sweep.
  • MediaLive BatchUpdateScheduleCommand β€” the only write surface. Every action β€” input switch, watermark, SCTE-35, ad splice β€” flows through it.
  • MongoDB β€” the source of truth for programs, videos, and ad markers; never queried inside the inner loop.
  • scheduleResults map β€” per-program scheduled / error status returned to the backend so the operator gets a diff, not a stack trace.



 

Key Engineering Decisions

1. Pre-warm the slow stuff before the loop runs. The original code did a per-schedule Mongo lookup and a per-schedule ffprobe call inside the enrichment loop β€” a classic N+1 paid twice. The current backend bulk-fetches every unique video and ad marker in one round trip each, then runs ffprobe in parallel across unique URLs and stashes the result in a resolution cache. The inner loop becomes a cache hit. ffprobe latency is paid once per unique URL, not serially in the loop.



 

2. Coalesce MediaLive writes into batches of ~25. Inside the orchestrator, actions accumulate into a single BatchUpdateScheduleCommand until 200 actions are queued or the last program is reached. Since each program produces ~7–8 actions, batches naturally land around 25 programs each. The 200-action cap was chosen conservatively to stay well below MediaLive's per-request payload limits and reduce the chance of a batch being rejected for size β€” large enough to amortise the network cost, small enough to make the per-program fallback (next decision) cheap when it has to run. One network round trip replaces twenty-five. The 200ms throttle that used to sit between every program now sits between every batch.



 

3. Batching for speed, fallback for correctness. Coalescing is only safe if a single bad program doesn't poison the other twenty-four in its batch. When submitProgramBatch throws, the catch invokes retryBatchAsIndividuals, which re-submits each program in the failed batch as its own BatchUpdateScheduleCommand, records status: 'scheduled' or status: 'error' per program, sleeps 200ms between attempts, and re-anchors the action timeline after each partial success. The fast path is batched. The recovery path is granular. The operator gets a per-program diff either way.



 

4. Sweep orphans at the end, don't prevent them mid-flight. sweepIncompleteProgramGroups runs once at the end of the deploy and removes any action group that didn't make it to a clean terminal state. We deliberately don't try to keep the channel internally consistent during the loop β€” that would mean a rollback path that itself has to fit in the 15-minute budget. Cleanup is a single sweep, not a transaction.



 

5. Don't grow the Lambda; replace it when the catalog does. For everything we deploy today the single-invocation path finishes well inside the budget. The honest scale-out is Step Functions chunking β€” partition the playlist, run chunks as parallel state machines, reassemble. That's the path for >1-month and 6-month playlists. It's designed, not deployed. Calling it the next step is more useful than pretending it's already running.



 

Results

  • In internal testing, a 195-program deployment completed in ~44 seconds β€” down from the multi-minute, single-program-at-a-time baseline.
  • The documented target for a 360-program (~1 month) playlist is ≀90 seconds, comfortably inside the 615-second backend timeout and the 15-minute Lambda ceiling.
  • One bad program in a batch no longer aborts the other 24. The operator gets a per-program status map back from every deploy.
  • The slow parts of the request β€” Mongo lookups and ffprobe β€” are paid once per unique resource, not once per schedule entry.
  • The 15-minute ceiling stopped being the limiting factor for the catalog sizes we actually ship. When it becomes one again, Step Functions chunking is the answer, not a bigger Lambda.



 



 

Technology Stack: AWS Lambda Β· AWS MediaLive Β· NestJS Β· MongoDB Β· Node 18 Β· ffprobe Β· TypeScript Β· AWS SDK v3




 

AWS LambdaServerlessSchedulingBackend EngineeringAWS

Om forfatteren

Pankaj

AI & Cloud Solutions Expert at MicrocosmWorks

Building innovative AI-powered solutions and helping businesses transform through cutting-edge technology.

Vil du lære mere?

Kontakt os for at diskutere, hvordan vi kan hjælpe med at implementere disse løsninger for din virksomhed.

Kom i Kontakt

Comments (0)

Share your thoughts and join the conversation

Leave a Comment

Your email will not be published

No comments yet

Be the first to share your thoughts!