Files
2nd/10_Wiki/Topics/AI_and_ML/Data-Pipeline Orchestration.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

8.2 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-data-pipeline-orchestration Data Pipeline Orchestration 10_Wiki/Topics verified self
data orchestration
Airflow
Prefect
Dagster
Kubeflow
DAG
ETL
ELT
none A 0.93 applied
data-engineering
orchestration
dag
airflow
prefect
dagster
mlops
etl
elt
2026-05-10 pending
language framework
Python Airflow / Prefect / Dagster / Kubeflow / Argo

Data Pipeline Orchestration

매 한 줄

"매 task 의 dependency graph 의 conductor". 매 DAG + 매 schedule + 매 retry + 매 observability. 매 Airflow (most common) → Prefect / Dagster (modern). 매 ML 의 Kubeflow / Argo. 매 modern: 매 asset-based (Dagster) + 매 declarative (dbt).

매 핵심

매 핵심 component

  • DAG: 매 task graph.
  • Scheduler: 매 trigger.
  • Executor: 매 run.
  • Metadata DB: 매 state.
  • Web UI: 매 monitor.

매 tool comparison

Tool Strength Weakness
Airflow 매 mature, Python, ecosystem Old API, tricky dynamic DAG
Prefect 매 modern Python, Pythonic Smaller ecosystem
Dagster 매 asset-aware, type-checked Steep learning curve
Argo Workflows 매 K8s-native Verbose YAML
Kubeflow Pipelines 매 ML-specific, K8s Heavy
Temporal 매 long-running workflow More for app workflow
dbt 매 SQL transformation SQL only

매 modern paradigm

Asset-based (Dagster)

  • 매 task X — 매 asset (data product).
  • 매 lineage explicit.
  • 매 partition + 매 backfill 의 first-class.

Declarative scheduling

  • 매 cron 의 X — 매 freshness SLA.
  • 매 sensor-driven.

Data + ML unified

  • 매 same orchestrator 의 ETL + train + serve.

매 best practice

  1. Idempotent task: 매 retry-safe.
  2. Atomic outputs: 매 partial 의 X.
  3. Backfill design: 매 historical 의 rerun 가능.
  4. Resource isolation: 매 pool / queue.
  5. Observability: 매 metrics + log + lineage.
  6. Schema check: 매 GE / Pandera.
  7. Cost-aware: 매 spot, 매 right-size.

매 응용

  1. ETL: 매 daily / hourly.
  2. ML training: 매 retrain.
  3. Feature engineering.
  4. Reporting.
  5. Backup.
  6. Multi-step ML pipeline (data → train → eval → deploy).

💻 패턴

Airflow DAG

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'depends_on_past': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
}

with DAG(
    'daily_etl',
    default_args=default_args,
    schedule_interval='0 2 * * *',
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:
    
    extract = PythonOperator(task_id='extract', python_callable=extract_data)
    transform = PythonOperator(task_id='transform', python_callable=transform_data)
    load = PythonOperator(task_id='load', python_callable=load_data)
    validate = PythonOperator(task_id='validate', python_callable=validate_with_ge)
    
    extract >> transform >> load >> validate

Prefect (modern Pythonic)

from prefect import flow, task
from prefect.task_runners import ConcurrentTaskRunner

@task(retries=3, retry_delay_seconds=60)
def extract():
    return fetch_from_source()

@task
def transform(data):
    return clean_and_aggregate(data)

@task
def load(transformed):
    db.bulk_insert(transformed)

@flow(task_runner=ConcurrentTaskRunner())
def daily_etl():
    raw = extract()
    clean = transform(raw)
    load(clean)

# 매 deploy
daily_etl.serve(name='daily-etl', cron='0 2 * * *')

Dagster (asset-based)

from dagster import asset, AssetIn, MetadataValue

@asset
def raw_users():
    return fetch_users()

@asset
def cleaned_users(raw_users):
    df = remove_duplicates(raw_users)
    df = validate_schema(df)
    return df

@asset(ins={'users': AssetIn('cleaned_users')})
def user_metrics(users):
    metrics = compute_metrics(users)
    return metrics

# 매 매 asset 의 lineage 의 explicit
# 매 backfill / partition 의 first-class

dbt (SQL-driven)

-- models/staging/stg_users.sql
{{ config(materialized='view') }}

SELECT
    id,
    LOWER(email) AS email,
    created_at
FROM {{ source('raw', 'users') }}
WHERE deleted_at IS NULL

-- models/marts/dim_users.sql
{{ config(materialized='table') }}

SELECT *
FROM {{ ref('stg_users') }}
dbt run    # 매 모든 model 의 build
dbt test   # 매 schema test

Idempotent task design

@task
def load_data(date: str, data: list):
    # 매 ❌ Bad: 매 append (rerun 시 의 dup)
    # db.insert(data)
    
    # 매 ✅ Good: 매 upsert / replace
    db.execute(f"DELETE FROM events WHERE date = '{date}'")
    db.bulk_insert(data)

Backfill (historical)

# 매 Airflow
airflow dags backfill -s 2024-01-01 -e 2024-12-31 daily_etl

# 매 Dagster (partitioned asset)
@asset(partitions_def=DailyPartitionsDefinition(start_date='2024-01-01'))
def daily_data(context):
    date = context.partition_key
    return fetch_for_date(date)

Schema validation in pipeline

import great_expectations as ge

@task
def validate_schema(df):
    suite = load_suite('users_quality')
    result = ge.validate(df, expectation_suite=suite)
    if not result.success:
        raise ValueError(f'Schema validation failed: {result.results}')
    return df

Sensor (event-driven)

from airflow.sensors.filesystem import FileSensor

wait_for_file = FileSensor(
    task_id='wait_for_input',
    filepath='/data/input.parquet',
    poke_interval=30,
    timeout=3600,
)

wait_for_file >> process_task

Resource isolation

# 매 Airflow pool
process_task = PythonOperator(
    task_id='process',
    python_callable=heavy_task,
    pool='gpu_pool',  # 매 GPU 의 dedicated
    pool_slots=1,
)

Observability (metrics + alert)

from prometheus_client import Counter, Histogram

task_runs = Counter('pipeline_runs_total', ['pipeline', 'status'])
task_duration = Histogram('pipeline_duration_seconds', ['pipeline'])

@task
def my_task():
    with task_duration.labels('my_pipeline').time():
        try:
            do_work()
            task_runs.labels('my_pipeline', 'success').inc()
        except Exception:
            task_runs.labels('my_pipeline', 'failed').inc()
            raise

Cost-aware (spot for batch)

@task(executor_config={'instance_type': 'g4dn.xlarge', 'use_spot': True})
def expensive_train():
    return train_model()

매 결정 기준

상황 Tool
Mature large org Airflow
Modern Pythonic Prefect
Asset-driven + ML Dagster
K8s-native Argo / Kubeflow
SQL-only dbt
Long-running app workflow Temporal
Event-driven Prefect / Dagster sensor

기본값: 매 small-mid = Prefect. 매 large + asset = Dagster. 매 ecosystem priority = Airflow.

🔗 Graph

🤖 LLM 활용

언제: 매 ETL design. 매 ML pipeline. 매 batch job orchestration. 언제 X: 매 single one-shot script. 매 stream-only (use Kafka / Flink).

안티패턴

  • Crontab + bash for complex DAG: 매 fragile.
  • Non-idempotent task: 매 retry 의 corrupt.
  • No backfill design: 매 historical 의 rerun X.
  • No observability: 매 silent failure.
  • No schema check: 매 downstream break.
  • Heavy DAG (1000+ task): 매 split.

🧪 검증 / 중복

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — tool comparison + 매 Airflow / Prefect / Dagster / dbt / sensor / observability code