[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -2,24 +2,162 @@
|
||||
id: wiki-2026-0508-principles-of-data-connect
|
||||
title: Principles of Data Connect
|
||||
category: 10_Wiki/Topics
|
||||
status: merged
|
||||
redirect_to: 데이터_엔지니어링_및_가상_인프라_표준
|
||||
canonical_id: wiki-2026-0508-001
|
||||
aliases: []
|
||||
status: verified
|
||||
canonical_id: self
|
||||
aliases: [Data Integration Principles, ETL Design]
|
||||
duplicate_of: none
|
||||
source_trust_level: A
|
||||
confidence_score: 0.92
|
||||
tags: [uncategorized]
|
||||
confidence_score: 0.85
|
||||
verification_status: applied
|
||||
tags: [data-engineering, etl, integration]
|
||||
raw_sources: []
|
||||
last_reinforced: 2026-05-08
|
||||
last_reinforced: 2026-05-10
|
||||
github_commit: pending
|
||||
inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
|
||||
tech_stack:
|
||||
language: unspecified
|
||||
framework: unspecified
|
||||
language: Python
|
||||
framework: dbt
|
||||
---
|
||||
|
||||
# Redirect
|
||||
# Principles of Data Connect
|
||||
|
||||
이 문서는 Canonical 문서인 [[데이터_엔지니어링_및_가상_인프라_표준]]으로 통합되었습니다.
|
||||
모든 최신 지식과 세부 내용은 위 링크를 참조하십시오.
|
||||
## 매 한 줄
|
||||
> **"매 source-to-warehouse 의 reliable pipe 의 design rules"**. 매 Inmon (1990s warehouse) → 매 Kimball (star schema) → 매 modern data stack (Fivetran/Airbyte → Snowflake/BigQuery → dbt) 의 evolution 의 distilled principles.
|
||||
|
||||
## 매 핵심
|
||||
|
||||
### 매 the principles
|
||||
1. **Idempotent loads** — re-run produces same result.
|
||||
2. **Schema-on-read tolerance** — handle source schema drift.
|
||||
3. **Replayability** — store raw, transform downstream.
|
||||
4. **Incremental + full-refresh** — both modes supported.
|
||||
5. **Observability** — row counts, freshness, anomaly alerts.
|
||||
6. **Lineage** — every column traces to source.
|
||||
7. **Privacy / PII** — masked or never-pulled.
|
||||
|
||||
### 매 modern stack (2026)
|
||||
- Extract-Load: Fivetran, Airbyte, Stitch.
|
||||
- Warehouse: Snowflake, BigQuery, Databricks.
|
||||
- Transform: dbt (most-prevalent), Coalesce, SQLMesh.
|
||||
- Orchestrate: Airflow, Dagster, Prefect.
|
||||
- Observability: Monte Carlo, Datafold, Elementary.
|
||||
|
||||
### 매 응용
|
||||
1. Analytics (BI dashboards).
|
||||
2. ML feature stores.
|
||||
3. Reverse-ETL to operational tools (Hightouch, Census).
|
||||
|
||||
## 💻 패턴
|
||||
|
||||
### Idempotent upsert (MERGE)
|
||||
```sql
|
||||
MERGE INTO dim_customer t
|
||||
USING staging_customer s
|
||||
ON t.customer_id = s.customer_id
|
||||
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET ...
|
||||
WHEN NOT MATCHED THEN INSERT (...) VALUES (...);
|
||||
```
|
||||
|
||||
### dbt incremental model
|
||||
```sql
|
||||
{{ config(materialized='incremental', unique_key='order_id', on_schema_change='append_new_columns') }}
|
||||
|
||||
select *
|
||||
from {{ source('raw', 'orders') }}
|
||||
{% if is_incremental() %}
|
||||
where _ingested_at > (select max(_ingested_at) from {{ this }})
|
||||
{% endif %}
|
||||
```
|
||||
|
||||
### Schema-on-read (raw landing)
|
||||
```sql
|
||||
-- raw zone: VARIANT / JSON column, no schema enforcement
|
||||
CREATE TABLE raw.events (
|
||||
_ingested_at TIMESTAMP,
|
||||
_source STRING,
|
||||
payload VARIANT
|
||||
);
|
||||
|
||||
-- bronze: typed extraction
|
||||
CREATE VIEW bronze.events AS
|
||||
SELECT _ingested_at, payload:event_type::STRING AS event_type, ...
|
||||
FROM raw.events;
|
||||
```
|
||||
|
||||
### Data quality test (dbt)
|
||||
```yaml
|
||||
# models/marts/orders.yml
|
||||
version: 2
|
||||
models:
|
||||
- name: dim_orders
|
||||
columns:
|
||||
- name: order_id
|
||||
tests: [not_null, unique]
|
||||
- name: total_amount
|
||||
tests:
|
||||
- not_null
|
||||
- dbt_expectations.expect_column_values_to_be_between:
|
||||
min_value: 0
|
||||
max_value: 1000000
|
||||
```
|
||||
|
||||
### Lineage (dbt-generated graph)
|
||||
```bash
|
||||
dbt docs generate
|
||||
dbt docs serve # column-level lineage in browser
|
||||
```
|
||||
|
||||
### PII masking on load
|
||||
```sql
|
||||
CREATE OR REPLACE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
|
||||
CASE WHEN CURRENT_ROLE() IN ('ANALYTICS_ADMIN') THEN val
|
||||
ELSE REGEXP_REPLACE(val, '.+@', '***@') END;
|
||||
|
||||
ALTER TABLE customers MODIFY COLUMN email SET MASKING POLICY email_mask;
|
||||
```
|
||||
|
||||
### Freshness SLA (dbt)
|
||||
```yaml
|
||||
sources:
|
||||
- name: stripe
|
||||
freshness:
|
||||
warn_after: { count: 1, period: hour }
|
||||
error_after: { count: 6, period: hour }
|
||||
loaded_at_field: _ingested_at
|
||||
```
|
||||
|
||||
## 매 결정 기준
|
||||
| Need | Tool |
|
||||
|---|---|
|
||||
| SaaS source ingestion | Fivetran / Airbyte |
|
||||
| Transform | dbt |
|
||||
| Orchestration | Dagster (modern) / Airflow (mature) |
|
||||
| Observability | Monte Carlo / Elementary |
|
||||
| Reverse ETL | Hightouch / Census |
|
||||
|
||||
**기본값**: Fivetran → Snowflake → dbt → Hightouch + dbt-tests + Elementary.
|
||||
|
||||
## 🔗 Graph
|
||||
- 부모: [[Data-Engineering]]
|
||||
- 변형: [[ETL]] · [[ELT]] · [[Reverse-ETL]]
|
||||
- 응용: [[Data-Warehousing]] · [[Feature-Store]]
|
||||
- Adjacent: [[dbt]] · [[Snowflake-Data-Warehousing]] · [[Airflow]]
|
||||
|
||||
## 🤖 LLM 활용
|
||||
**언제**: data-pipeline design, ETL architecture review, warehouse migration.
|
||||
**언제 X**: streaming-only / event-driven systems (use Kafka patterns instead).
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **Transform-on-extract**: 매 lose replay capability.
|
||||
- **No idempotency**: re-runs corrupt warehouse.
|
||||
- **Untested models**: 매 silent breakage.
|
||||
- **PII in raw zone unmasked**: compliance risk.
|
||||
|
||||
## 🧪 검증 / 중복
|
||||
- Verified (Kimball — Data Warehouse Toolkit; Modern Data Stack docs; dbt best practices).
|
||||
- 신뢰도 A-.
|
||||
|
||||
## 🕓 Changelog
|
||||
| 날짜 | 변경 |
|
||||
|---|---|
|
||||
| 2026-05-08 | Phase 1 |
|
||||
| 2026-05-10 | Manual cleanup — Data Connect FULL with modern data stack patterns |
|
||||
|
||||
Reference in New Issue
Block a user