2nd/10_Wiki/Topics/Backend/Principles-of-Data-Connect.md

---
id: wiki-2026-0508-principles-of-data-connect
title: Principles of Data Connect
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Data Integration Principles, ETL Design]
duplicate_of: none
source_trust_level: A
confidence_score: 0.85
verification_status: applied
tags: [data-engineering, etl, integration]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: dbt
---

# Principles of Data Connect

## 매 한 줄
> **"매 source-to-warehouse 의 reliable pipe 의 design rules"**. 매 Inmon (1990s warehouse) → 매 Kimball (star schema) → 매 modern data stack (Fivetran/Airbyte → Snowflake/BigQuery → dbt) 의 evolution 의 distilled principles.

## 매 핵심

### 매 the principles
1. **Idempotent loads** — re-run produces same result.
2. **Schema-on-read tolerance** — handle source schema drift.
3. **Replayability** — store raw, transform downstream.
4. **Incremental + full-refresh** — both modes supported.
5. **Observability** — row counts, freshness, anomaly alerts.
6. **Lineage** — every column traces to source.
7. **Privacy / PII** — masked or never-pulled.

### 매 modern stack (2026)
- Extract-Load: Fivetran, Airbyte, Stitch.
- Warehouse: Snowflake, BigQuery, Databricks.
- Transform: dbt (most-prevalent), Coalesce, SQLMesh.
- Orchestrate: Airflow, Dagster, Prefect.
- Observability: Monte Carlo, Datafold, Elementary.

### 매 응용
1. Analytics (BI dashboards).
2. ML feature stores.
3. Reverse-ETL to operational tools (Hightouch, Census).

## 💻 패턴

### Idempotent upsert (MERGE)
```sql
MERGE INTO dim_customer t
USING staging_customer s
  ON t.customer_id = s.customer_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...) VALUES (...);
```

### dbt incremental model
```sql
{{ config(materialized='incremental', unique_key='order_id', on_schema_change='append_new_columns') }}

select *
from {{ source('raw', 'orders') }}
{% if is_incremental() %}
where _ingested_at > (select max(_ingested_at) from {{ this }})
{% endif %}
```

### Schema-on-read (raw landing)
```sql
-- raw zone: VARIANT / JSON column, no schema enforcement
CREATE TABLE raw.events (
  _ingested_at TIMESTAMP,
  _source      STRING,
  payload      VARIANT
);

-- bronze: typed extraction
CREATE VIEW bronze.events AS
SELECT _ingested_at, payload:event_type::STRING AS event_type, ...
FROM raw.events;
```

### Data quality test (dbt)
```yaml
# models/marts/orders.yml
version: 2
models:
  - name: dim_orders
    columns:
      - name: order_id
        tests: [not_null, unique]
      - name: total_amount
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0
              max_value: 1000000
```

### Lineage (dbt-generated graph)
```bash
dbt docs generate
dbt docs serve   # column-level lineage in browser
```

### PII masking on load
```sql
CREATE OR REPLACE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
  CASE WHEN CURRENT_ROLE() IN ('ANALYTICS_ADMIN') THEN val
       ELSE REGEXP_REPLACE(val, '.+@', '***@') END;

ALTER TABLE customers MODIFY COLUMN email SET MASKING POLICY email_mask;
```

### Freshness SLA (dbt)
```yaml
sources:
  - name: stripe
    freshness:
      warn_after: { count: 1, period: hour }
      error_after: { count: 6, period: hour }
    loaded_at_field: _ingested_at
```

## 매 결정 기준
| Need | Tool |
|---|---|
| SaaS source ingestion | Fivetran / Airbyte |
| Transform | dbt |
| Orchestration | Dagster (modern) / Airflow (mature) |
| Observability | Monte Carlo / Elementary |
| Reverse ETL | Hightouch / Census |

**기본값**: Fivetran → Snowflake → dbt → Hightouch + dbt-tests + Elementary.

## 🔗 Graph
- 부모: [[Data-Engineering]]
- 변형: [[ETL]] · [[ELT]]
- 응용: [[Feature-Store]]
- Adjacent: [[dbt]] · [[Snowflake-Data-Warehousing]] · [[Airflow]]

## 🤖 LLM 활용
**언제**: data-pipeline design, ETL architecture review, warehouse migration.
**언제 X**: streaming-only / event-driven systems (use Kafka patterns instead).

## ❌ 안티패턴
- **Transform-on-extract**: 매 lose replay capability.
- **No idempotency**: re-runs corrupt warehouse.
- **Untested models**: 매 silent breakage.
- **PII in raw zone unmasked**: compliance risk.

## 🧪 검증 / 중복
- Verified (Kimball — Data Warehouse Toolkit; Modern Data Stack docs; dbt best practices).
- 신뢰도 A-.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Data Connect FULL with modern data stack patterns |