Files
2nd/10_Wiki/Topics/Backend/Principles-of-Data-Connect.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

4.5 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-principles-of-data-connect Principles of Data Connect 10_Wiki/Topics verified self
Data Integration Principles
ETL Design
none A 0.85 applied
data-engineering
etl
integration
2026-05-10 pending
language framework
Python dbt

Principles of Data Connect

매 한 줄

"매 source-to-warehouse 의 reliable pipe 의 design rules". 매 Inmon (1990s warehouse) → 매 Kimball (star schema) → 매 modern data stack (Fivetran/Airbyte → Snowflake/BigQuery → dbt) 의 evolution 의 distilled principles.

매 핵심

매 the principles

  1. Idempotent loads — re-run produces same result.
  2. Schema-on-read tolerance — handle source schema drift.
  3. Replayability — store raw, transform downstream.
  4. Incremental + full-refresh — both modes supported.
  5. Observability — row counts, freshness, anomaly alerts.
  6. Lineage — every column traces to source.
  7. Privacy / PII — masked or never-pulled.

매 modern stack (2026)

  • Extract-Load: Fivetran, Airbyte, Stitch.
  • Warehouse: Snowflake, BigQuery, Databricks.
  • Transform: dbt (most-prevalent), Coalesce, SQLMesh.
  • Orchestrate: Airflow, Dagster, Prefect.
  • Observability: Monte Carlo, Datafold, Elementary.

매 응용

  1. Analytics (BI dashboards).
  2. ML feature stores.
  3. Reverse-ETL to operational tools (Hightouch, Census).

💻 패턴

Idempotent upsert (MERGE)

MERGE INTO dim_customer t
USING staging_customer s
  ON t.customer_id = s.customer_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...) VALUES (...);

dbt incremental model

{{ config(materialized='incremental', unique_key='order_id', on_schema_change='append_new_columns') }}

select *
from {{ source('raw', 'orders') }}
{% if is_incremental() %}
where _ingested_at > (select max(_ingested_at) from {{ this }})
{% endif %}

Schema-on-read (raw landing)

-- raw zone: VARIANT / JSON column, no schema enforcement
CREATE TABLE raw.events (
  _ingested_at TIMESTAMP,
  _source      STRING,
  payload      VARIANT
);

-- bronze: typed extraction
CREATE VIEW bronze.events AS
SELECT _ingested_at, payload:event_type::STRING AS event_type, ...
FROM raw.events;

Data quality test (dbt)

# models/marts/orders.yml
version: 2
models:
  - name: dim_orders
    columns:
      - name: order_id
        tests: [not_null, unique]
      - name: total_amount
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0
              max_value: 1000000

Lineage (dbt-generated graph)

dbt docs generate
dbt docs serve   # column-level lineage in browser

PII masking on load

CREATE OR REPLACE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
  CASE WHEN CURRENT_ROLE() IN ('ANALYTICS_ADMIN') THEN val
       ELSE REGEXP_REPLACE(val, '.+@', '***@') END;

ALTER TABLE customers MODIFY COLUMN email SET MASKING POLICY email_mask;

Freshness SLA (dbt)

sources:
  - name: stripe
    freshness:
      warn_after: { count: 1, period: hour }
      error_after: { count: 6, period: hour }
    loaded_at_field: _ingested_at

매 결정 기준

Need Tool
SaaS source ingestion Fivetran / Airbyte
Transform dbt
Orchestration Dagster (modern) / Airflow (mature)
Observability Monte Carlo / Elementary
Reverse ETL Hightouch / Census

기본값: Fivetran → Snowflake → dbt → Hightouch + dbt-tests + Elementary.

🔗 Graph

🤖 LLM 활용

언제: data-pipeline design, ETL architecture review, warehouse migration. 언제 X: streaming-only / event-driven systems (use Kafka patterns instead).

안티패턴

  • Transform-on-extract: 매 lose replay capability.
  • No idempotency: re-runs corrupt warehouse.
  • Untested models: 매 silent breakage.
  • PII in raw zone unmasked: compliance risk.

🧪 검증 / 중복

  • Verified (Kimball — Data Warehouse Toolkit; Modern Data Stack docs; dbt best practices).
  • 신뢰도 A-.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — Data Connect FULL with modern data stack patterns