MicrocosmWorks디지털 코스모스 혁신 및 설계
소개연락처
MicrocosmWorks디지털 코스모스를 혁신하고 설계합니다

중요한 IT 솔루션을 제공합니다. 기술, 보안에 열정적이며 신뢰할 수 있는 혁신적인 IT 인프라를 통해 비즈니스 성장을 돕습니다.

[email protected]
+91 7011868196
New Delhi, India

AI 성장 허브

AI 허브스타트업 혁신기업 가속기

솔루션

모든 솔루션웰니스 및 피트니스 앱AI 비디오 플랫폼AI 에이전트 개발

자원

통찰력산업 가이드사용 사례 청사진아키텍처 패턴사례 연구

회사

회사 소개연락처우리의 작업

서비스

디지털 컨설팅클라우드 인프라SaaS 개발AI 개발비디오 기술
ERP 개발Zoho 맞춤화Odoo 개발Salesforce 통합맞춤형 CRM 개발
QuickBooks 통합IoT 솔루션블록체인 개발
사이버 보안 컨설팅IT 지원 - L3

© 2026 MicrocosmWorks. 모든 권리 보유.

개인정보 처리방침서비스 약관
통찰로 돌아가기
Cloud Solutions

How We Built One-Click Program Scheduling by Overcoming AWS Lambda's 15-Minute Limit

WS Lambda's 15-minute execution cap seems like a hard wall for long-running scheduled jobs. Here's how we redesigned our architecture to deliver seamless one-click program scheduling anyway.

Pankaj
•
July 1, 2026
•
수정일 July 1, 2026
•
5 minutes
5 minutes

How We Built One-Click Program Scheduling by Overcoming AWS Lambda's 15-Minute Limit

A FAST channel operator schedules a month of programming and hits Deploy once. Behind that single click, hundreds of programs become thousands of MediaLive schedule actions that have to land in AWS in order, all inside Lambda's fifteen-minute execution ceiling — and any failure in the middle leaves a live channel with holes in it. Here is how we made one-click deployment of large playlists reliable, and why the next step out is a different runtime, not a bigger Lambda.



 

Quick overview

AspectDetail
RuntimeAWS Lambda (single invocation), 615s backend timeout
Output targetMediaLive BatchUpdateScheduleCommand
BatchingUp to 200 MediaLive actions per request — ~25 programs in practice
Per-program actions~7–8 (input switch + 4 per-rendition watermarks + 2 SCTE-35), more with ad breaks
FallbackPer-program retry on any batch rejection
Observed195 programs deployed in ~44 seconds (local measurement)
Documented target360 programs in ≤90 seconds
Scale-out pathStep Functions chunking for >1-month playlists — planned, not shipped



 

The Challenge

Deploying a playlist isn't a write — it's an orchestration. For every program the operator scheduled, MediaLive needs to know exactly when to switch inputs, when to turn the watermark on for each rendition, when to insert SCTE-35 cue points, and when to splice ads. The structural pressures are:



 

  • The action count is multiplicative, not additive. Each program emits ~7–8 actions before ad breaks — one input switch, four watermark activations (one per rendition because StaticImageOutputActivate uses output-pixel coordinates), and two SCTE-35 markers. Ad breaks add three more actions each. A 360-program month is roughly 2,800 actions on the wire.
  • MediaLive enforces ordering per channel. Schedule actions are time-anchored and reference each other; you cannot parallelize writes to the same channel without the API rejecting them as conflicting.
  • AWS Lambda has a hard 15-minute ceiling. Not a soft limit, not a configuration. The orchestrator must finish inside that wall or the channel ends up half-deployed.
  • Partial failure is operationally unacceptable. If program 174 of 360 fails and aborts the run, the operator has no diff view of what landed and what didn't. The channel goes live with gaps; viewers see slate where they expected content.
  • The first version sent one BatchUpdateSchedule per program with a 200ms sleep between programs. That's ~2.5 seconds of wall time per program. At 360 programs you are already past the Lambda ceiling before MediaLive has done any real work.



 

The job, then, is not write faster. It is write fewer times, survive partial failure, and stay inside one invocation — without losing the ordering guarantees MediaLive demands.



 

Why Existing Approaches Fail

The obvious escapes all fail for reasons that are structural, not tuning problems.



 

  • "Just raise the Lambda timeout." You can't. Fifteen minutes is an AWS-imposed hard ceiling on Lambda execution; it isn't a knob in the console. Even if it were, the per-program cost grows with the catalog — buying more wall time only delays the next wall.
  • "Move to ECS Fargate or EC2." We considered it and rejected it. Long-running containers mean we own the runtime: health checks, autoscaling, cold-start vs. warm-pool tradeoffs, IAM scoping, and on-call rotation for a service that runs in bursts. Lambda gives us per-invocation isolation and zero idle cost for an inherently bursty workload. We weren't ready to give that up to fix one bottleneck.
  • "Parallelise the MediaLive writes." Schedule updates are serialised by MediaLive for each channel. Concurrent BatchUpdateSchedule calls against the same channel race on the action timeline and get rejected. The only legitimate parallelism is within a batch, not across batches.
  • "Just let it crash and let the operator retry." This is the worst option. When a deploy aborts at program N, the channel is in a state nobody can describe from the UI. Operators don't get a diff; they get a black box and a live channel with holes in it. The system has to either land everything or land partial results with per-program status the operator can act on.



 

The lever we had left was the shape of the work itself: fewer, larger writes, executed serially, with a fallback that degrades to row-level granularity only when a batch rejects.



 

Our Solution

Move the cost out of Lambda before the loop starts, coalesce the MediaLive writes inside the loop, and degrade to per-program submission only when a batch fails. Three ideas, in that order — and a deliberate decision that the 15-minute ceiling is fine for the catalog sizes we actually deploy today. When playlists outgrow a single invocation, the answer isn't a bigger Lambda; it's a different runtime.



 

Backend (NestJS)                          Lambda (Node 18)                      MediaLive



 

─────────────────                         ────────────────                      ─────────



 

deployToLambda()                          fastChannel-lambda-fun



 

  ├─ bulk fetch videos + adMarkers  ─┐



 

  ├─ parallel ffprobe (unique URLs)  ├─► invoke (615s timeout)



 

  └─ enriched payload ───────────────┘     │



 

                                           ├─ for each program:



 

                                           │   buildProgramActions()  ~7–8 actions



 

                                           │   batchActions.push(...)



 

                                           │   if >=200 actions queued:



 

                                           │     submitProgramBatch() ──────────► BatchUpdateSchedule



 

                                           │     on failure → retryBatchAsIndividuals()



 

                                           │     sleep(200ms)



 

                                           └─ sweepIncompleteProgramGroups()



 

Architecture

  • NestJS backend (schedule.service.ts) — pre-warms the deploy: one Mongo round trip for all unique videos, one for all ad markers, then parallel ffprobe across unique video URLs with the results cached for the enrichment loop.
  • Lambda orchestrator (fastChannel-lambda-fun/index.js) — owns the batching loop, the per-batch throttle, the per-program fallback, and the orphan sweep.
  • MediaLive BatchUpdateScheduleCommand — the only write surface. Every action — input switch, watermark, SCTE-35, ad splice — flows through it.
  • MongoDB — the source of truth for programs, videos, and ad markers; never queried inside the inner loop.
  • scheduleResults map — per-program scheduled / error status returned to the backend so the operator gets a diff, not a stack trace.



 

Key Engineering Decisions

1. Pre-warm the slow stuff before the loop runs. The original code did a per-schedule Mongo lookup and a per-schedule ffprobe call inside the enrichment loop — a classic N+1 paid twice. The current backend bulk-fetches every unique video and ad marker in one round trip each, then runs ffprobe in parallel across unique URLs and stashes the result in a resolution cache. The inner loop becomes a cache hit. ffprobe latency is paid once per unique URL, not serially in the loop.



 

2. Coalesce MediaLive writes into batches of ~25. Inside the orchestrator, actions accumulate into a single BatchUpdateScheduleCommand until 200 actions are queued or the last program is reached. Since each program produces ~7–8 actions, batches naturally land around 25 programs each. The 200-action cap was chosen conservatively to stay well below MediaLive's per-request payload limits and reduce the chance of a batch being rejected for size — large enough to amortise the network cost, small enough to make the per-program fallback (next decision) cheap when it has to run. One network round trip replaces twenty-five. The 200ms throttle that used to sit between every program now sits between every batch.



 

3. Batching for speed, fallback for correctness. Coalescing is only safe if a single bad program doesn't poison the other twenty-four in its batch. When submitProgramBatch throws, the catch invokes retryBatchAsIndividuals, which re-submits each program in the failed batch as its own BatchUpdateScheduleCommand, records status: 'scheduled' or status: 'error' per program, sleeps 200ms between attempts, and re-anchors the action timeline after each partial success. The fast path is batched. The recovery path is granular. The operator gets a per-program diff either way.



 

4. Sweep orphans at the end, don't prevent them mid-flight. sweepIncompleteProgramGroups runs once at the end of the deploy and removes any action group that didn't make it to a clean terminal state. We deliberately don't try to keep the channel internally consistent during the loop — that would mean a rollback path that itself has to fit in the 15-minute budget. Cleanup is a single sweep, not a transaction.



 

5. Don't grow the Lambda; replace it when the catalog does. For everything we deploy today the single-invocation path finishes well inside the budget. The honest scale-out is Step Functions chunking — partition the playlist, run chunks as parallel state machines, reassemble. That's the path for >1-month and 6-month playlists. It's designed, not deployed. Calling it the next step is more useful than pretending it's already running.



 

Results

  • In internal testing, a 195-program deployment completed in ~44 seconds — down from the multi-minute, single-program-at-a-time baseline.
  • The documented target for a 360-program (~1 month) playlist is ≤90 seconds, comfortably inside the 615-second backend timeout and the 15-minute Lambda ceiling.
  • One bad program in a batch no longer aborts the other 24. The operator gets a per-program status map back from every deploy.
  • The slow parts of the request — Mongo lookups and ffprobe — are paid once per unique resource, not once per schedule entry.
  • The 15-minute ceiling stopped being the limiting factor for the catalog sizes we actually ship. When it becomes one again, Step Functions chunking is the answer, not a bigger Lambda.



 



 

Technology Stack: AWS Lambda · AWS MediaLive · NestJS · MongoDB · Node 18 · ffprobe · TypeScript · AWS SDK v3




 

AWS LambdaServerlessSchedulingBackend EngineeringAWS

저자 소개

Pankaj

AI & Cloud Solutions Expert at MicrocosmWorks

Building innovative AI-powered solutions and helping businesses transform through cutting-edge technology.

더 자세히 알고 싶으신가요?

비즈니스를 위한 이러한 솔루션 구현 방법에 대해 문의하세요.

연락하기

Comments (0)

Share your thoughts and join the conversation

Leave a Comment

Your email will not be published

No comments yet

Be the first to share your thoughts!