Files
2nd/10_Wiki/Topics/Coding/Data_Eng_dbt.md
T
2026-05-09 21:08:02 +09:00

7.0 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
data-eng-dbt dbt — SQL Transform / Test / Doc Coding draft B conceptual 2026-05-09 2026-05-09
data-engineering
dbt
sql
vibe-coding
language applicable_to
SQL / Jinja
Data Engineering
dbt
dbt-core
model
source
seed
snapshot
macro

dbt (data build tool)

SQL transform 의 modern 표준. Model = SELECT, ref / source = lineage, test = data quality, docs = 자동. ELT (Extract Load Transform) 의 T.

📖 핵심 개념

  • Model: .sql file = SELECT → table / view.
  • Source: 외부 raw table.
  • Ref: 다른 model 참조.
  • Test: 데이터 품질 검사.
  • Snapshot: SCD Type 2.

💻 코드 패턴

폴더 구조

dbt_project/
├── dbt_project.yml
├── models/
│   ├── staging/
│   │   ├── stg_orders.sql
│   │   └── stg_users.sql
│   ├── marts/
│   │   ├── core/
│   │   │   ├── dim_users.sql
│   │   │   └── fct_orders.sql
│   │   └── finance/
│   │       └── revenue_daily.sql
│   └── schema.yml
├── macros/
├── tests/
├── seeds/
└── snapshots/

Model

-- models/staging/stg_orders.sql
{{ config(materialized='view') }}

select
    id as order_id,
    user_id,
    amount,
    status,
    created_at
from {{ source('raw', 'orders') }}
where status != 'cancelled'
-- models/marts/core/fct_orders.sql
{{ config(
    materialized='incremental',
    unique_key='order_id',
    on_schema_change='fail'
) }}

select
    o.order_id,
    o.user_id,
    u.email,
    o.amount,
    o.created_at
from {{ ref('stg_orders') }} o
left join {{ ref('dim_users') }} u using (user_id)

{% if is_incremental() %}
  where o.created_at > (select max(created_at) from {{ this }})
{% endif %}

sources / schema

# models/staging/sources.yml
version: 2

sources:
  - name: raw
    database: production
    schema: public
    tables:
      - name: orders
        loaded_at_field: _ingested_at
        freshness:
          warn_after: { count: 1, period: hour }
          error_after: { count: 4, period: hour }
      - name: users

Tests

# models/marts/core/schema.yml
version: 2

models:
  - name: dim_users
    description: User dimension
    columns:
      - name: user_id
        description: Primary key
        tests:
          - unique
          - not_null
      - name: email
        tests:
          - not_null
          - unique
      - name: plan
        tests:
          - accepted_values:
              values: ['free', 'pro', 'enterprise']

  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: user_id
        tests:
          - relationships:
              to: ref('dim_users')
              field: user_id

Custom test

-- tests/positive_amount.sql
select * from {{ ref('fct_orders') }}
where amount <= 0

Macro (재사용)

-- macros/clean_string.sql
{% macro clean_string(col) %}
    trim(lower({{ col }}))
{% endmacro %}

-- 사용
select {{ clean_string('email') }} as email_clean
from {{ ref('stg_users') }}

Materialization 종류

view:        SELECT 마다 — 작은 / 자주 변경
table:       전체 rebuild — 작은-중간
incremental: 변경 분만 추가 — 큰 fact
ephemeral:   CTE inline — temp transformation
snapshot:    SCD Type 2 (history)

Snapshot (SCD Type 2 — history)

-- snapshots/users_snapshot.sql
{% snapshot users_snapshot %}
    {{ config(
        target_schema='snapshots',
        unique_key='user_id',
        strategy='check',
        check_cols=['email', 'plan'],
    ) }}
    select user_id, email, plan, updated_at from {{ source('raw', 'users') }}
{% endsnapshot %}
dbt snapshot

→ 변경 추적: dbt_valid_from, dbt_valid_to.

Seed (작은 lookup table)

# seeds/country_codes.csv
country_code,country_name
US,United States
KR,Korea
JP,Japan
dbt seed
select * from {{ ref('country_codes') }}

Run / test

dbt run                          # 모든 model 빌드
dbt run --select stg_orders+    # stg_orders 와 downstream
dbt run --select +dim_users     # dim_users 와 upstream

dbt test
dbt test --select dim_users

# 첫 build = full, 이후 = incremental
dbt build  # run + test 한 번

Docs

dbt docs generate
dbt docs serve  # web UI

# Lineage graph + column descriptions

CI

- run: dbt deps
- run: dbt seed --target ci
- run: dbt run --target ci
- run: dbt test --target ci
- run: dbt source freshness  # source 신선도
# .github/workflows/dbt.yml
- run: dbt build --select state:modified+ --defer --state ./prod-manifest

→ 변경된 model + downstream 만 build.

Adapter (warehouse)

dbt-postgres, dbt-snowflake, dbt-bigquery, dbt-redshift, dbt-databricks, dbt-duckdb

→ 같은 코드 다른 warehouse.

profiles.yml (connection)

my_project:
  outputs:
    dev:
      type: postgres
      host: localhost
      user: dev
      password: "{{ env_var('DB_PW') }}"
      dbname: dev
      schema: dbt_dev
    prod:
      type: postgres
      host: prod.example.com
      user: dbt_prod
      schema: analytics
  target: dev

Performance

-- Incremental + clustering / partition
{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge',
    partition_by={'field': 'date', 'data_type': 'date'},
    cluster_by=['user_id'],
) }}

→ BigQuery / Snowflake clustering.

dbt Cloud / Mesh (큰 조직)

  • 매니지드 dbt run.
  • Cross-project (dbt Mesh).
  • IDE in browser.
  • Scheduling.

→ 또는 Airflow / Dagster 가 dbt 호출.

Dagster + dbt

from dagster_dbt import dbt_assets, DbtCliResource

@dbt_assets(manifest=Path('target/manifest.json'))
def my_dbt_assets(context, dbt: DbtCliResource):
    yield from dbt.cli(['build'], context=context).stream()

→ dbt model 가 Dagster asset 으로 자동.

🤔 의사결정 기준

작업 추천
SQL transform dbt
Python ML pipeline Dagster / Airflow
Streaming Flink / Spark Structured
Schema migration dbt-snapshot 또는 dbt-history
작은 / 단일 SQL 직접
큰 organization dbt Cloud / Dagster

안티패턴

  • 모든 model materialized='table': rebuild 비쌈. incremental.
  • Test 없는 model: 데이터 quality 모름.
  • Source freshness 없음: stale 데이터 모름.
  • Schema.yml 없음: column descriptions 없음.
  • Macro 남발: 가독성 떨어짐.
  • Production 에 직접 dbt run: scheduler 없이.
  • Seed 큰 데이터 (>10MB): ETL 로.

🤖 LLM 활용 힌트

  • staging / marts 분리 표준.
  • ref / source 항상.
  • Test (unique, not_null, relationships) 필수.
  • Incremental 큰 table.

🔗 관련 문서