2nd/10_Wiki/Topics/Coding/Data_Eng_dbt.md

---
id: data-eng-dbt
title: dbt — SQL Transform / Test / Doc
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [data-engineering, dbt, sql, vibe-coding]
tech_stack: { language: "SQL / Jinja", applicable_to: ["Data Engineering"] }
applied_in: []
aliases: [dbt, dbt-core, model, source, seed, snapshot, macro]
---

# dbt (data build tool)

> SQL transform 의 modern 표준. **Model = SELECT, ref / source = lineage, test = data quality, docs = 자동**. ELT (Extract Load Transform) 의 T.

## 📖 핵심 개념
- Model: `.sql` file = SELECT → table / view.
- Source: 외부 raw table.
- Ref: 다른 model 참조.
- Test: 데이터 품질 검사.
- Snapshot: SCD Type 2.

## 💻 코드 패턴

### 폴더 구조
```
dbt_project/
├── dbt_project.yml
├── models/
│   ├── staging/
│   │   ├── stg_orders.sql
│   │   └── stg_users.sql
│   ├── marts/
│   │   ├── core/
│   │   │   ├── dim_users.sql
│   │   │   └── fct_orders.sql
│   │   └── finance/
│   │       └── revenue_daily.sql
│   └── schema.yml
├── macros/
├── tests/
├── seeds/
└── snapshots/
```

### Model
```sql
-- models/staging/stg_orders.sql
{{ config(materialized='view') }}

select
    id as order_id,
    user_id,
    amount,
    status,
    created_at
from {{ source('raw', 'orders') }}
where status != 'cancelled'
```

```sql
-- models/marts/core/fct_orders.sql
{{ config(
    materialized='incremental',
    unique_key='order_id',
    on_schema_change='fail'
) }}

select
    o.order_id,
    o.user_id,
    u.email,
    o.amount,
    o.created_at
from {{ ref('stg_orders') }} o
left join {{ ref('dim_users') }} u using (user_id)

{% if is_incremental() %}
  where o.created_at > (select max(created_at) from {{ this }})
{% endif %}
```

### sources / schema
```yaml
# models/staging/sources.yml
version: 2

sources:
  - name: raw
    database: production
    schema: public
    tables:
      - name: orders
        loaded_at_field: _ingested_at
        freshness:
          warn_after: { count: 1, period: hour }
          error_after: { count: 4, period: hour }
      - name: users
```

### Tests
```yaml
# models/marts/core/schema.yml
version: 2

models:
  - name: dim_users
    description: User dimension
    columns:
      - name: user_id
        description: Primary key
        tests:
          - unique
          - not_null
      - name: email
        tests:
          - not_null
          - unique
      - name: plan
        tests:
          - accepted_values:
              values: ['free', 'pro', 'enterprise']

  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: user_id
        tests:
          - relationships:
              to: ref('dim_users')
              field: user_id
```

### Custom test
```sql
-- tests/positive_amount.sql
select * from {{ ref('fct_orders') }}
where amount <= 0
```

### Macro (재사용)
```sql
-- macros/clean_string.sql
{% macro clean_string(col) %}
    trim(lower({{ col }}))
{% endmacro %}

-- 사용
select {{ clean_string('email') }} as email_clean
from {{ ref('stg_users') }}
```

### Materialization 종류
```
view:        SELECT 마다 — 작은 / 자주 변경
table:       전체 rebuild — 작은-중간
incremental: 변경 분만 추가 — 큰 fact
ephemeral:   CTE inline — temp transformation
snapshot:    SCD Type 2 (history)
```

### Snapshot (SCD Type 2 — history)
```sql
-- snapshots/users_snapshot.sql
{% snapshot users_snapshot %}
    {{ config(
        target_schema='snapshots',
        unique_key='user_id',
        strategy='check',
        check_cols=['email', 'plan'],
    ) }}
    select user_id, email, plan, updated_at from {{ source('raw', 'users') }}
{% endsnapshot %}
```

```bash
dbt snapshot
```

→ 변경 추적: `dbt_valid_from`, `dbt_valid_to`.

### Seed (작은 lookup table)
```csv
# seeds/country_codes.csv
country_code,country_name
US,United States
KR,Korea
JP,Japan
```

```bash
dbt seed
```

```sql
select * from {{ ref('country_codes') }}
```

### Run / test
```bash
dbt run                          # 모든 model 빌드
dbt run --select stg_orders+    # stg_orders 와 downstream
dbt run --select +dim_users     # dim_users 와 upstream

dbt test
dbt test --select dim_users

# 첫 build = full, 이후 = incremental
dbt build  # run + test 한 번
```

### Docs
```bash
dbt docs generate
dbt docs serve  # web UI

# Lineage graph + column descriptions
```

### CI
```yaml
- run: dbt deps
- run: dbt seed --target ci
- run: dbt run --target ci
- run: dbt test --target ci
- run: dbt source freshness  # source 신선도
```

```yaml
# .github/workflows/dbt.yml
- run: dbt build --select state:modified+ --defer --state ./prod-manifest
```

→ 변경된 model + downstream 만 build.

### Adapter (warehouse)
```
dbt-postgres, dbt-snowflake, dbt-bigquery, dbt-redshift, dbt-databricks, dbt-duckdb
```

→ 같은 코드 다른 warehouse.

### profiles.yml (connection)
```yaml
my_project:
  outputs:
    dev:
      type: postgres
      host: localhost
      user: dev
      password: "{{ env_var('DB_PW') }}"
      dbname: dev
      schema: dbt_dev
    prod:
      type: postgres
      host: prod.example.com
      user: dbt_prod
      schema: analytics
  target: dev
```

### Performance
```sql
-- Incremental + clustering / partition
{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge',
    partition_by={'field': 'date', 'data_type': 'date'},
    cluster_by=['user_id'],
) }}
```

→ BigQuery / Snowflake clustering.

### dbt Cloud / Mesh (큰 조직)
- 매니지드 dbt run.
- Cross-project (dbt Mesh).
- IDE in browser.
- Scheduling.

→ 또는 Airflow / Dagster 가 dbt 호출.

### Dagster + dbt
```python
from dagster_dbt import dbt_assets, DbtCliResource

@dbt_assets(manifest=Path('target/manifest.json'))
def my_dbt_assets(context, dbt: DbtCliResource):
    yield from dbt.cli(['build'], context=context).stream()
```

→ dbt model 가 Dagster asset 으로 자동.

## 🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| SQL transform | dbt |
| Python ML pipeline | Dagster / Airflow |
| Streaming | Flink / Spark Structured |
| Schema migration | dbt-snapshot 또는 dbt-history |
| 작은 / 단일 SQL | 직접 |
| 큰 organization | dbt Cloud / Dagster |

## ❌ 안티패턴
- **모든 model `materialized='table'`**: rebuild 비쌈. incremental.
- **Test 없는 model**: 데이터 quality 모름.
- **Source freshness 없음**: stale 데이터 모름.
- **Schema.yml 없음**: column descriptions 없음.
- **Macro 남발**: 가독성 떨어짐.
- **Production 에 직접 dbt run**: scheduler 없이.
- **Seed 큰 데이터 (>10MB)**: ETL 로.

## 🤖 LLM 활용 힌트
- staging / marts 분리 표준.
- ref / source 항상.
- Test (unique, not_null, relationships) 필수.
- Incremental 큰 table.

## 🔗 관련 문서
- [[Data_Eng_Airflow_Dagster]]
- [[Data_Eng_Lakehouse]]
- [[DB_ClickHouse_OLAP]]