Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

4.5 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Principles of Data Connect

매 한 줄

"매 source-to-warehouse 의 reliable pipe 의 design rules". 매 Inmon (1990s warehouse) → 매 Kimball (star schema) → 매 modern data stack (Fivetran/Airbyte → Snowflake/BigQuery → dbt) 의 evolution 의 distilled principles.

매 핵심

매 the principles

Idempotent loads — re-run produces same result.
Schema-on-read tolerance — handle source schema drift.
Replayability — store raw, transform downstream.
Incremental + full-refresh — both modes supported.
Observability — row counts, freshness, anomaly alerts.
Lineage — every column traces to source.
Privacy / PII — masked or never-pulled.

매 modern stack (2026)

Extract-Load: Fivetran, Airbyte, Stitch.
Warehouse: Snowflake, BigQuery, Databricks.
Transform: dbt (most-prevalent), Coalesce, SQLMesh.
Orchestrate: Airflow, Dagster, Prefect.
Observability: Monte Carlo, Datafold, Elementary.

매 응용

Analytics (BI dashboards).
ML feature stores.
Reverse-ETL to operational tools (Hightouch, Census).

💻 패턴

Idempotent upsert (MERGE)

MERGE INTO dim_customer t
USING staging_customer s
  ON t.customer_id = s.customer_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...) VALUES (...);

dbt incremental model

{{ config(materialized='incremental', unique_key='order_id', on_schema_change='append_new_columns') }}

select *
from {{ source('raw', 'orders') }}
{% if is_incremental() %}
where _ingested_at > (select max(_ingested_at) from {{ this }})
{% endif %}

Schema-on-read (raw landing)

-- raw zone: VARIANT / JSON column, no schema enforcement
CREATE TABLE raw.events (
  _ingested_at TIMESTAMP,
  _source      STRING,
  payload      VARIANT
);

-- bronze: typed extraction
CREATE VIEW bronze.events AS
SELECT _ingested_at, payload:event_type::STRING AS event_type, ...
FROM raw.events;

Data quality test (dbt)

# models/marts/orders.yml
version: 2
models:
  - name: dim_orders
    columns:
      - name: order_id
        tests: [not_null, unique]
      - name: total_amount
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0
              max_value: 1000000

Lineage (dbt-generated graph)

dbt docs generate
dbt docs serve   # column-level lineage in browser

PII masking on load

CREATE OR REPLACE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
  CASE WHEN CURRENT_ROLE() IN ('ANALYTICS_ADMIN') THEN val
       ELSE REGEXP_REPLACE(val, '.+@', '***@') END;

ALTER TABLE customers MODIFY COLUMN email SET MASKING POLICY email_mask;

Freshness SLA (dbt)

sources:
  - name: stripe
    freshness:
      warn_after: { count: 1, period: hour }
      error_after: { count: 6, period: hour }
    loaded_at_field: _ingested_at

매 결정 기준

Need	Tool
SaaS source ingestion	Fivetran / Airbyte
Transform	dbt
Orchestration	Dagster (modern) / Airflow (mature)
Observability	Monte Carlo / Elementary
Reverse ETL	Hightouch / Census

기본값: Fivetran → Snowflake → dbt → Hightouch + dbt-tests + Elementary.

🔗 Graph

부모: Data-Engineering
변형: ETL · ELT
응용: Feature-Store
Adjacent: dbt · Snowflake-Data-Warehousing · Airflow

🤖 LLM 활용

언제: data-pipeline design, ETL architecture review, warehouse migration. 언제 X: streaming-only / event-driven systems (use Kafka patterns instead).

❌ 안티패턴

Transform-on-extract: 매 lose replay capability.
No idempotency: re-runs corrupt warehouse.
Untested models: 매 silent breakage.
PII in raw zone unmasked: compliance risk.

🧪 검증 / 중복

Verified (Kimball — Data Warehouse Toolkit; Modern Data Stack docs; dbt best practices).
신뢰도 A-.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — Data Connect FULL with modern data stack patterns

4.5 KiB Raw Blame History