id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id
title
category
status
canonical_id
aliases
duplicate_of
source_trust_level
confidence_score
verification_status
tags
raw_sources
last_reinforced
github_commit
tech_stack
wiki-2026-0508-principles-of-data-connect
Principles of Data Connect
10_Wiki/Topics
verified
self
Data Integration Principles
ETL Design
none
A
0.85
applied
data-engineering
etl
integration
2026-05-10
pending
language
framework
Python
dbt
Principles of Data Connect
매 한 줄
"매 source-to-warehouse 의 reliable pipe 의 design rules" . 매 Inmon (1990s warehouse) → 매 Kimball (star schema) → 매 modern data stack (Fivetran/Airbyte → Snowflake/BigQuery → dbt) 의 evolution 의 distilled principles.
매 핵심
매 the principles
Idempotent loads — re-run produces same result.
Schema-on-read tolerance — handle source schema drift.
Replayability — store raw, transform downstream.
Incremental + full-refresh — both modes supported.
Observability — row counts, freshness, anomaly alerts.
Lineage — every column traces to source.
Privacy / PII — masked or never-pulled.
매 modern stack (2026)
Extract-Load: Fivetran, Airbyte, Stitch.
Warehouse: Snowflake, BigQuery, Databricks.
Transform: dbt (most-prevalent), Coalesce, SQLMesh.
Orchestrate: Airflow, Dagster, Prefect.
Observability: Monte Carlo, Datafold, Elementary.
매 응용
Analytics (BI dashboards).
ML feature stores.
Reverse-ETL to operational tools (Hightouch, Census).
💻 패턴
Idempotent upsert (MERGE)
dbt incremental model
Schema-on-read (raw landing)
Data quality test (dbt)
Lineage (dbt-generated graph)
PII masking on load
Freshness SLA (dbt)
매 결정 기준
Need
Tool
SaaS source ingestion
Fivetran / Airbyte
Transform
dbt
Orchestration
Dagster (modern) / Airflow (mature)
Observability
Monte Carlo / Elementary
Reverse ETL
Hightouch / Census
기본값 : Fivetran → Snowflake → dbt → Hightouch + dbt-tests + Elementary.
🔗 Graph
🤖 LLM 활용
언제 : data-pipeline design, ETL architecture review, warehouse migration.
언제 X : streaming-only / event-driven systems (use Kafka patterns instead).
❌ 안티패턴
Transform-on-extract : 매 lose replay capability.
No idempotency : re-runs corrupt warehouse.
Untested models : 매 silent breakage.
PII in raw zone unmasked : compliance risk.
🧪 검증 / 중복
Verified (Kimball — Data Warehouse Toolkit; Modern Data Stack docs; dbt best practices).
신뢰도 A-.
🕓 Changelog
날짜
변경
2026-05-08
Phase 1
2026-05-10
Manual cleanup — Data Connect FULL with modern data stack patterns