--- id: wiki-2026-0508-principles-of-data-connect title: Principles of Data Connect category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Data Integration Principles, ETL Design] duplicate_of: none source_trust_level: A confidence_score: 0.85 verification_status: applied tags: [data-engineering, etl, integration] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: dbt --- # Principles of Data Connect ## 매 한 줄 > **"매 source-to-warehouse 의 reliable pipe 의 design rules"**. 매 Inmon (1990s warehouse) → 매 Kimball (star schema) → 매 modern data stack (Fivetran/Airbyte → Snowflake/BigQuery → dbt) 의 evolution 의 distilled principles. ## 매 핵심 ### 매 the principles 1. **Idempotent loads** — re-run produces same result. 2. **Schema-on-read tolerance** — handle source schema drift. 3. **Replayability** — store raw, transform downstream. 4. **Incremental + full-refresh** — both modes supported. 5. **Observability** — row counts, freshness, anomaly alerts. 6. **Lineage** — every column traces to source. 7. **Privacy / PII** — masked or never-pulled. ### 매 modern stack (2026) - Extract-Load: Fivetran, Airbyte, Stitch. - Warehouse: Snowflake, BigQuery, Databricks. - Transform: dbt (most-prevalent), Coalesce, SQLMesh. - Orchestrate: Airflow, Dagster, Prefect. - Observability: Monte Carlo, Datafold, Elementary. ### 매 응용 1. Analytics (BI dashboards). 2. ML feature stores. 3. Reverse-ETL to operational tools (Hightouch, Census). ## 💻 패턴 ### Idempotent upsert (MERGE) ```sql MERGE INTO dim_customer t USING staging_customer s ON t.customer_id = s.customer_id WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT (...) VALUES (...); ``` ### dbt incremental model ```sql {{ config(materialized='incremental', unique_key='order_id', on_schema_change='append_new_columns') }} select * from {{ source('raw', 'orders') }} {% if is_incremental() %} where _ingested_at > (select max(_ingested_at) from {{ this }}) {% endif %} ``` ### Schema-on-read (raw landing) ```sql -- raw zone: VARIANT / JSON column, no schema enforcement CREATE TABLE raw.events ( _ingested_at TIMESTAMP, _source STRING, payload VARIANT ); -- bronze: typed extraction CREATE VIEW bronze.events AS SELECT _ingested_at, payload:event_type::STRING AS event_type, ... FROM raw.events; ``` ### Data quality test (dbt) ```yaml # models/marts/orders.yml version: 2 models: - name: dim_orders columns: - name: order_id tests: [not_null, unique] - name: total_amount tests: - not_null - dbt_expectations.expect_column_values_to_be_between: min_value: 0 max_value: 1000000 ``` ### Lineage (dbt-generated graph) ```bash dbt docs generate dbt docs serve # column-level lineage in browser ``` ### PII masking on load ```sql CREATE OR REPLACE MASKING POLICY email_mask AS (val STRING) RETURNS STRING -> CASE WHEN CURRENT_ROLE() IN ('ANALYTICS_ADMIN') THEN val ELSE REGEXP_REPLACE(val, '.+@', '***@') END; ALTER TABLE customers MODIFY COLUMN email SET MASKING POLICY email_mask; ``` ### Freshness SLA (dbt) ```yaml sources: - name: stripe freshness: warn_after: { count: 1, period: hour } error_after: { count: 6, period: hour } loaded_at_field: _ingested_at ``` ## 매 결정 기준 | Need | Tool | |---|---| | SaaS source ingestion | Fivetran / Airbyte | | Transform | dbt | | Orchestration | Dagster (modern) / Airflow (mature) | | Observability | Monte Carlo / Elementary | | Reverse ETL | Hightouch / Census | **기본값**: Fivetran → Snowflake → dbt → Hightouch + dbt-tests + Elementary. ## 🔗 Graph - 부모: [[Data-Engineering]] - 변형: [[ETL]] · [[ELT]] - 응용: [[Feature-Store]] - Adjacent: [[dbt]] · [[Snowflake-Data-Warehousing]] · [[Airflow]] ## 🤖 LLM 활용 **언제**: data-pipeline design, ETL architecture review, warehouse migration. **언제 X**: streaming-only / event-driven systems (use Kafka patterns instead). ## ❌ 안티패턴 - **Transform-on-extract**: 매 lose replay capability. - **No idempotency**: re-runs corrupt warehouse. - **Untested models**: 매 silent breakage. - **PII in raw zone unmasked**: compliance risk. ## 🧪 검증 / 중복 - Verified (Kimball — Data Warehouse Toolkit; Modern Data Stack docs; dbt best practices). - 신뢰도 A-. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Data Connect FULL with modern data stack patterns |