Files
2nd/10_Wiki/Topics/Coding/Data_Eng_dbt.md
T
2026-05-09 21:08:02 +09:00

331 lines
7.0 KiB
Markdown

---
id: data-eng-dbt
title: dbt — SQL Transform / Test / Doc
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [data-engineering, dbt, sql, vibe-coding]
tech_stack: { language: "SQL / Jinja", applicable_to: ["Data Engineering"] }
applied_in: []
aliases: [dbt, dbt-core, model, source, seed, snapshot, macro]
---
# dbt (data build tool)
> SQL transform 의 modern 표준. **Model = SELECT, ref / source = lineage, test = data quality, docs = 자동**. ELT (Extract Load Transform) 의 T.
## 📖 핵심 개념
- Model: `.sql` file = SELECT → table / view.
- Source: 외부 raw table.
- Ref: 다른 model 참조.
- Test: 데이터 품질 검사.
- Snapshot: SCD Type 2.
## 💻 코드 패턴
### 폴더 구조
```
dbt_project/
├── dbt_project.yml
├── models/
│ ├── staging/
│ │ ├── stg_orders.sql
│ │ └── stg_users.sql
│ ├── marts/
│ │ ├── core/
│ │ │ ├── dim_users.sql
│ │ │ └── fct_orders.sql
│ │ └── finance/
│ │ └── revenue_daily.sql
│ └── schema.yml
├── macros/
├── tests/
├── seeds/
└── snapshots/
```
### Model
```sql
-- models/staging/stg_orders.sql
{{ config(materialized='view') }}
select
id as order_id,
user_id,
amount,
status,
created_at
from {{ source('raw', 'orders') }}
where status != 'cancelled'
```
```sql
-- models/marts/core/fct_orders.sql
{{ config(
materialized='incremental',
unique_key='order_id',
on_schema_change='fail'
) }}
select
o.order_id,
o.user_id,
u.email,
o.amount,
o.created_at
from {{ ref('stg_orders') }} o
left join {{ ref('dim_users') }} u using (user_id)
{% if is_incremental() %}
where o.created_at > (select max(created_at) from {{ this }})
{% endif %}
```
### sources / schema
```yaml
# models/staging/sources.yml
version: 2
sources:
- name: raw
database: production
schema: public
tables:
- name: orders
loaded_at_field: _ingested_at
freshness:
warn_after: { count: 1, period: hour }
error_after: { count: 4, period: hour }
- name: users
```
### Tests
```yaml
# models/marts/core/schema.yml
version: 2
models:
- name: dim_users
description: User dimension
columns:
- name: user_id
description: Primary key
tests:
- unique
- not_null
- name: email
tests:
- not_null
- unique
- name: plan
tests:
- accepted_values:
values: ['free', 'pro', 'enterprise']
- name: fct_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: user_id
tests:
- relationships:
to: ref('dim_users')
field: user_id
```
### Custom test
```sql
-- tests/positive_amount.sql
select * from {{ ref('fct_orders') }}
where amount <= 0
```
### Macro (재사용)
```sql
-- macros/clean_string.sql
{% macro clean_string(col) %}
trim(lower({{ col }}))
{% endmacro %}
-- 사용
select {{ clean_string('email') }} as email_clean
from {{ ref('stg_users') }}
```
### Materialization 종류
```
view: SELECT 마다 — 작은 / 자주 변경
table: 전체 rebuild — 작은-중간
incremental: 변경 분만 추가 — 큰 fact
ephemeral: CTE inline — temp transformation
snapshot: SCD Type 2 (history)
```
### Snapshot (SCD Type 2 — history)
```sql
-- snapshots/users_snapshot.sql
{% snapshot users_snapshot %}
{{ config(
target_schema='snapshots',
unique_key='user_id',
strategy='check',
check_cols=['email', 'plan'],
) }}
select user_id, email, plan, updated_at from {{ source('raw', 'users') }}
{% endsnapshot %}
```
```bash
dbt snapshot
```
→ 변경 추적: `dbt_valid_from`, `dbt_valid_to`.
### Seed (작은 lookup table)
```csv
# seeds/country_codes.csv
country_code,country_name
US,United States
KR,Korea
JP,Japan
```
```bash
dbt seed
```
```sql
select * from {{ ref('country_codes') }}
```
### Run / test
```bash
dbt run # 모든 model 빌드
dbt run --select stg_orders+ # stg_orders 와 downstream
dbt run --select +dim_users # dim_users 와 upstream
dbt test
dbt test --select dim_users
# 첫 build = full, 이후 = incremental
dbt build # run + test 한 번
```
### Docs
```bash
dbt docs generate
dbt docs serve # web UI
# Lineage graph + column descriptions
```
### CI
```yaml
- run: dbt deps
- run: dbt seed --target ci
- run: dbt run --target ci
- run: dbt test --target ci
- run: dbt source freshness # source 신선도
```
```yaml
# .github/workflows/dbt.yml
- run: dbt build --select state:modified+ --defer --state ./prod-manifest
```
→ 변경된 model + downstream 만 build.
### Adapter (warehouse)
```
dbt-postgres, dbt-snowflake, dbt-bigquery, dbt-redshift, dbt-databricks, dbt-duckdb
```
→ 같은 코드 다른 warehouse.
### profiles.yml (connection)
```yaml
my_project:
outputs:
dev:
type: postgres
host: localhost
user: dev
password: "{{ env_var('DB_PW') }}"
dbname: dev
schema: dbt_dev
prod:
type: postgres
host: prod.example.com
user: dbt_prod
schema: analytics
target: dev
```
### Performance
```sql
-- Incremental + clustering / partition
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
partition_by={'field': 'date', 'data_type': 'date'},
cluster_by=['user_id'],
) }}
```
→ BigQuery / Snowflake clustering.
### dbt Cloud / Mesh (큰 조직)
- 매니지드 dbt run.
- Cross-project (dbt Mesh).
- IDE in browser.
- Scheduling.
→ 또는 Airflow / Dagster 가 dbt 호출.
### Dagster + dbt
```python
from dagster_dbt import dbt_assets, DbtCliResource
@dbt_assets(manifest=Path('target/manifest.json'))
def my_dbt_assets(context, dbt: DbtCliResource):
yield from dbt.cli(['build'], context=context).stream()
```
→ dbt model 가 Dagster asset 으로 자동.
## 🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| SQL transform | dbt |
| Python ML pipeline | Dagster / Airflow |
| Streaming | Flink / Spark Structured |
| Schema migration | dbt-snapshot 또는 dbt-history |
| 작은 / 단일 SQL | 직접 |
| 큰 organization | dbt Cloud / Dagster |
## ❌ 안티패턴
- **모든 model `materialized='table'`**: rebuild 비쌈. incremental.
- **Test 없는 model**: 데이터 quality 모름.
- **Source freshness 없음**: stale 데이터 모름.
- **Schema.yml 없음**: column descriptions 없음.
- **Macro 남발**: 가독성 떨어짐.
- **Production 에 직접 dbt run**: scheduler 없이.
- **Seed 큰 데이터 (>10MB)**: ETL 로.
## 🤖 LLM 활용 힌트
- staging / marts 분리 표준.
- ref / source 항상.
- Test (unique, not_null, relationships) 필수.
- Incremental 큰 table.
## 🔗 관련 문서
- [[Data_Eng_Airflow_Dagster]]
- [[Data_Eng_Lakehouse]]
- [[DB_ClickHouse_OLAP]]