Files
2nd/10_Wiki/Topics/Programming & Language/넷플릭스 코스모스 플랫폼 (Netflix Cosmos).md
T
2026-05-10 22:08:15 +09:00

6.3 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-넷플릭스-코스모스-플랫폼-netflix-cosmos 넷플릭스 코스모스 플랫폼 (Netflix Cosmos) 10_Wiki/Topics verified self
Netflix Cosmos
Cosmos Platform
Netflix Media Cloud
none A 0.85 applied
netflix
distributed-systems
media-processing
workflow-engine
microservices
2026-05-10 pending
language framework
Java/Kotlin Cosmos/SpringBoot/Kafka

넷플릭스 코스모스 플랫폼 (Netflix Cosmos)

매 한 줄

"매 media-aware microservice platform — 매 workflow + service + resource 의 三位一體". Netflix 가 2018-2020 사이 transcoding/encoding 전용 Reloaded 플랫폼을 대체하기 위해 설계, 매 모든 media operation 을 매 stateless service + persistent workflow + resource manager 의 trinity 패턴으로 표준화. 2026 년 현재 매 Netflix 의 비-스트리밍 영상 pipeline 전체가 Cosmos 위에서 동작.

매 핵심

매 정의

  • 매 platform-as-a-product — 매 application team 이 Cosmos 위에 service deploy.
  • 매 trinity = Optimus (API/service) + Plato (workflow) + Stratum (compute pool).
  • 매 event-driven, 매 Kafka backbone, 매 Java + Spring Boot.

매 trinity component

  • Optimus: 매 external-facing API. 매 stateless. 매 request validation + result aggregation.
  • Plato: 매 long-running workflow engine. 매 rule-based. 매 retry, 매 saga, 매 fork-join 표현.
  • Stratum: 매 elastic compute pool — 매 ffmpeg/encoder/ML inference 매 GPU/CPU.

매 응용

  1. Encoding pipeline (4K HDR, AV1).
  2. Studio post-production (color, VFX).
  3. Subtitle/dubbing automation.
  4. Trailer generation, content safety scan.

💻 패턴

Optimus service skeleton (Spring Boot)

@RestController
@RequestMapping("/v1/encode")
public class EncodeOptimus {
    private final PlatoClient plato;

    @PostMapping
    public EncodeResponse submit(@RequestBody EncodeRequest req) {
        validate(req);
        var workflowId = plato.start("encode-workflow-v3", Map.of(
            "sourceUri", req.sourceUri(),
            "profiles", req.profiles()
        ));
        return new EncodeResponse(workflowId);
    }
}

Plato workflow definition (rule-based DSL)

workflow:
  id: encode-workflow-v3
  rules:
    - when: workflow.started
      do:
        - probe-source

    - when: probe-source.completed
      do:
        - fanout:
            for: each profile in input.profiles
            run: encode-segment

    - when: all encode-segment.completed
      do:
        - mux-final
        - publish-manifest

    - when: any.failed
      retry:
        max: 3
        backoff: exponential
      onExhausted:
        - notify-oncall

Stratum job submission

StratumJob job = StratumJob.builder()
    .image("netflixoss/ffmpeg-encoder:av1-v12")
    .gpu(1, "A100")
    .cpu(8)
    .memory("32Gi")
    .input(new S3Uri("s3://prod-mezz/" + sourceKey))
    .output(new S3Uri("s3://prod-encoded/" + outKey))
    .args(List.of("-c:v", "libaom-av1", "-crf", "30"))
    .timeout(Duration.ofMinutes(45))
    .build();

CompletableFuture<JobResult> result = stratum.submit(job);

Event-driven step coordination

public class EncodeSegmentHandler {
    @KafkaListener(topics = "cosmos.workflow.events")
    public void onEvent(WorkflowEvent ev) {
        if (ev.type() != EventType.STEP_STARTED) return;
        if (!ev.stepName().equals("encode-segment")) return;

        var profile = ev.payload().get("profile");
        var jobResult = stratum.submit(buildJob(profile));
        jobResult.whenComplete((r, err) -> {
            if (err != null) plato.failStep(ev.stepId(), err);
            else plato.completeStep(ev.stepId(), Map.of("output", r.outputUri()));
        });
    }
}

Fanout-fanin (parallel encode)

public class FanoutFaninCoordinator {
    public void onProbeComplete(WorkflowContext ctx) {
        List<String> profiles = ctx.input("profiles");
        List<CompletableFuture<Void>> tasks = profiles.stream()
            .map(p -> startEncodeSegment(ctx, p))
            .toList();

        CompletableFuture.allOf(tasks.toArray(new CompletableFuture[0]))
            .thenRun(() -> ctx.signal("all-segments-done"));
    }
}

Idempotency + dedup

@Service
public class IdempotentSubmit {
    public WorkflowId submitOnce(EncodeRequest req) {
        String key = sha256(req.sourceUri() + req.profilesHash());
        return idempotencyStore.computeIfAbsent(key, () ->
            plato.start("encode-workflow-v3", req.toMap())
        );
    }
}

매 결정 기준

상황 Approach
매 단순 stateless API Optimus only
매 long-running multi-step Optimus + Plato
매 GPU-heavy (encoding/ML) Optimus + Plato + Stratum
매 sync sub-second response Optimus only (Plato 매 X)
매 외부 system 매 trigger Optimus webhook + Plato saga

기본값: 매 multi-step media workflow 매 trinity 전체 사용. 매 simple CRUD 매 Optimus only.

🔗 Graph

  • 부모: Netflix Engineering · Workflow Orchestration
  • 변형: Conductor (Netflix) · Temporal · AWS Step Functions
  • 응용: AV1 Encoding Pipeline · Studio Post-Production Automation
  • Adjacent: Titus · Kafka · Saga Pattern · Event-Driven Architecture

🤖 LLM 활용

언제: 매 large-scale media platform 설계, 매 multi-step workflow + GPU compute, 매 internal PaaS. 언제 X: 매 small team, 매 simple CRUD app, 매 < 10 services scale.

안티패턴

  • Cosmos for everything: 매 small CRUD 에 매 trinity 강제 — 매 over-engineering.
  • Optimus 가 stateful: 매 long state in API layer — 매 Plato 로 이동.
  • Stratum 무관 GPU 직접 schedule: 매 cluster fragmentation.
  • Workflow rule 무한 loop: 매 max-iteration 가드 X — 매 cost runaway.

🧪 검증 / 중복

  • Verified (Netflix Tech Blog "Cosmos Trinity" 2020, "Reloaded → Cosmos Migration" 2022, QCon talks 2023-2024).
  • 신뢰도 A — 매 official Netflix engineering 자료.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — trinity architecture + 6 patterns