--- id: wiki-2026-0508-data-pipeline-orchestration title: Data Pipeline Orchestration category: 10_Wiki/Topics status: verified canonical_id: self aliases: [data orchestration, Airflow, Prefect, Dagster, Kubeflow, DAG, ETL, ELT] duplicate_of: none source_trust_level: A confidence_score: 0.93 verification_status: applied tags: [data-engineering, orchestration, dag, airflow, prefect, dagster, mlops, etl, elt] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: Airflow / Prefect / Dagster / Kubeflow / Argo --- # Data Pipeline Orchestration ## 매 한 줄 > **"매 task 의 dependency graph 의 conductor"**. 매 DAG + 매 schedule + 매 retry + 매 observability. 매 Airflow (most common) → Prefect / Dagster (modern). 매 ML 의 Kubeflow / Argo. 매 modern: 매 asset-based (Dagster) + 매 declarative (dbt). ## 매 핵심 ### 매 핵심 component - **DAG**: 매 task graph. - **Scheduler**: 매 trigger. - **Executor**: 매 run. - **Metadata DB**: 매 state. - **Web UI**: 매 monitor. ### 매 tool comparison | Tool | Strength | Weakness | |---|---|---| | Airflow | 매 mature, Python, ecosystem | Old API, tricky dynamic DAG | | Prefect | 매 modern Python, Pythonic | Smaller ecosystem | | Dagster | 매 asset-aware, type-checked | Steep learning curve | | Argo Workflows | 매 K8s-native | Verbose YAML | | Kubeflow Pipelines | 매 ML-specific, K8s | Heavy | | Temporal | 매 long-running workflow | More for app workflow | | dbt | 매 SQL transformation | SQL only | ### 매 modern paradigm #### Asset-based (Dagster) - 매 task X — 매 asset (data product). - 매 lineage explicit. - 매 partition + 매 backfill 의 first-class. #### Declarative scheduling - 매 cron 의 X — 매 freshness SLA. - 매 sensor-driven. #### Data + ML unified - 매 same orchestrator 의 ETL + train + serve. ### 매 best practice 1. **Idempotent task**: 매 retry-safe. 2. **Atomic outputs**: 매 partial 의 X. 3. **Backfill design**: 매 historical 의 rerun 가능. 4. **Resource isolation**: 매 pool / queue. 5. **Observability**: 매 metrics + log + lineage. 6. **Schema check**: 매 GE / Pandera. 7. **Cost-aware**: 매 spot, 매 right-size. ### 매 응용 1. **ETL**: 매 daily / hourly. 2. **ML training**: 매 retrain. 3. **Feature engineering**. 4. **Reporting**. 5. **Backup**. 6. **Multi-step ML pipeline** (data → train → eval → deploy). ## 💻 패턴 ### Airflow DAG ```python from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime, timedelta default_args = { 'owner': 'data-team', 'depends_on_past': False, 'retries': 3, 'retry_delay': timedelta(minutes=5), } with DAG( 'daily_etl', default_args=default_args, schedule_interval='0 2 * * *', start_date=datetime(2026, 1, 1), catchup=False, ) as dag: extract = PythonOperator(task_id='extract', python_callable=extract_data) transform = PythonOperator(task_id='transform', python_callable=transform_data) load = PythonOperator(task_id='load', python_callable=load_data) validate = PythonOperator(task_id='validate', python_callable=validate_with_ge) extract >> transform >> load >> validate ``` ### Prefect (modern Pythonic) ```python from prefect import flow, task from prefect.task_runners import ConcurrentTaskRunner @task(retries=3, retry_delay_seconds=60) def extract(): return fetch_from_source() @task def transform(data): return clean_and_aggregate(data) @task def load(transformed): db.bulk_insert(transformed) @flow(task_runner=ConcurrentTaskRunner()) def daily_etl(): raw = extract() clean = transform(raw) load(clean) # 매 deploy daily_etl.serve(name='daily-etl', cron='0 2 * * *') ``` ### Dagster (asset-based) ```python from dagster import asset, AssetIn, MetadataValue @asset def raw_users(): return fetch_users() @asset def cleaned_users(raw_users): df = remove_duplicates(raw_users) df = validate_schema(df) return df @asset(ins={'users': AssetIn('cleaned_users')}) def user_metrics(users): metrics = compute_metrics(users) return metrics # 매 매 asset 의 lineage 의 explicit # 매 backfill / partition 의 first-class ``` ### dbt (SQL-driven) ```sql -- models/staging/stg_users.sql {{ config(materialized='view') }} SELECT id, LOWER(email) AS email, created_at FROM {{ source('raw', 'users') }} WHERE deleted_at IS NULL -- models/marts/dim_users.sql {{ config(materialized='table') }} SELECT * FROM {{ ref('stg_users') }} ``` ```bash dbt run # 매 모든 model 의 build dbt test # 매 schema test ``` ### Idempotent task design ```python @task def load_data(date: str, data: list): # 매 ❌ Bad: 매 append (rerun 시 의 dup) # db.insert(data) # 매 ✅ Good: 매 upsert / replace db.execute(f"DELETE FROM events WHERE date = '{date}'") db.bulk_insert(data) ``` ### Backfill (historical) ```python # 매 Airflow airflow dags backfill -s 2024-01-01 -e 2024-12-31 daily_etl # 매 Dagster (partitioned asset) @asset(partitions_def=DailyPartitionsDefinition(start_date='2024-01-01')) def daily_data(context): date = context.partition_key return fetch_for_date(date) ``` ### Schema validation in pipeline ```python import great_expectations as ge @task def validate_schema(df): suite = load_suite('users_quality') result = ge.validate(df, expectation_suite=suite) if not result.success: raise ValueError(f'Schema validation failed: {result.results}') return df ``` ### Sensor (event-driven) ```python from airflow.sensors.filesystem import FileSensor wait_for_file = FileSensor( task_id='wait_for_input', filepath='/data/input.parquet', poke_interval=30, timeout=3600, ) wait_for_file >> process_task ``` ### Resource isolation ```python # 매 Airflow pool process_task = PythonOperator( task_id='process', python_callable=heavy_task, pool='gpu_pool', # 매 GPU 의 dedicated pool_slots=1, ) ``` ### Observability (metrics + alert) ```python from prometheus_client import Counter, Histogram task_runs = Counter('pipeline_runs_total', ['pipeline', 'status']) task_duration = Histogram('pipeline_duration_seconds', ['pipeline']) @task def my_task(): with task_duration.labels('my_pipeline').time(): try: do_work() task_runs.labels('my_pipeline', 'success').inc() except Exception: task_runs.labels('my_pipeline', 'failed').inc() raise ``` ### Cost-aware (spot for batch) ```python @task(executor_config={'instance_type': 'g4dn.xlarge', 'use_spot': True}) def expensive_train(): return train_model() ``` ## 매 결정 기준 | 상황 | Tool | |---|---| | Mature large org | Airflow | | Modern Pythonic | Prefect | | Asset-driven + ML | Dagster | | K8s-native | Argo / Kubeflow | | SQL-only | dbt | | Long-running app workflow | Temporal | | Event-driven | Prefect / Dagster sensor | **기본값**: 매 small-mid = Prefect. 매 large + asset = Dagster. 매 ecosystem priority = Airflow. ## 🔗 Graph - 부모: [[Data-Engineering]] · [[MLOps]] · [[DevOps]] - 변형: [[Airflow]] · [[Prefect]] · [[Dagster]] · [[dbt]] · [[Temporal]] - 응용: [[ETL]] · [[ELT]] - Adjacent: [[Concept-Drift]] · [[Data Cleaning Algorithms]] · [[Data-Flywheel-Effect]] · [[Bottlenecks]] ## 🤖 LLM 활용 **언제**: 매 ETL design. 매 ML pipeline. 매 batch job orchestration. **언제 X**: 매 single one-shot script. 매 stream-only (use Kafka / Flink). ## ❌ 안티패턴 - **Crontab + bash for complex DAG**: 매 fragile. - **Non-idempotent task**: 매 retry 의 corrupt. - **No backfill design**: 매 historical 의 rerun X. - **No observability**: 매 silent failure. - **No schema check**: 매 downstream break. - **Heavy DAG (1000+ task)**: 매 split. ## 🧪 검증 / 중복 - Verified (Airflow / Prefect / Dagster docs, Designing Data-Intensive Apps). - 신뢰도 A. - Related: [[Data Cleaning Algorithms]] · [[Concept-Drift]] · [[Data-Flywheel-Effect]] · [[Bottlenecks]] · [[CI_CD 파이프라인 및 IDE 통합 보안]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — tool comparison + 매 Airflow / Prefect / Dagster / dbt / sensor / observability code |