331 lines
7.0 KiB
Markdown
331 lines
7.0 KiB
Markdown
---
|
|
id: data-eng-dbt
|
|
title: dbt — SQL Transform / Test / Doc
|
|
category: Coding
|
|
status: draft
|
|
source_trust_level: B
|
|
verification_status: conceptual
|
|
created_at: 2026-05-09
|
|
updated_at: 2026-05-09
|
|
tags: [data-engineering, dbt, sql, vibe-coding]
|
|
tech_stack: { language: "SQL / Jinja", applicable_to: ["Data Engineering"] }
|
|
applied_in: []
|
|
aliases: [dbt, dbt-core, model, source, seed, snapshot, macro]
|
|
---
|
|
|
|
# dbt (data build tool)
|
|
|
|
> SQL transform 의 modern 표준. **Model = SELECT, ref / source = lineage, test = data quality, docs = 자동**. ELT (Extract Load Transform) 의 T.
|
|
|
|
## 📖 핵심 개념
|
|
- Model: `.sql` file = SELECT → table / view.
|
|
- Source: 외부 raw table.
|
|
- Ref: 다른 model 참조.
|
|
- Test: 데이터 품질 검사.
|
|
- Snapshot: SCD Type 2.
|
|
|
|
## 💻 코드 패턴
|
|
|
|
### 폴더 구조
|
|
```
|
|
dbt_project/
|
|
├── dbt_project.yml
|
|
├── models/
|
|
│ ├── staging/
|
|
│ │ ├── stg_orders.sql
|
|
│ │ └── stg_users.sql
|
|
│ ├── marts/
|
|
│ │ ├── core/
|
|
│ │ │ ├── dim_users.sql
|
|
│ │ │ └── fct_orders.sql
|
|
│ │ └── finance/
|
|
│ │ └── revenue_daily.sql
|
|
│ └── schema.yml
|
|
├── macros/
|
|
├── tests/
|
|
├── seeds/
|
|
└── snapshots/
|
|
```
|
|
|
|
### Model
|
|
```sql
|
|
-- models/staging/stg_orders.sql
|
|
{{ config(materialized='view') }}
|
|
|
|
select
|
|
id as order_id,
|
|
user_id,
|
|
amount,
|
|
status,
|
|
created_at
|
|
from {{ source('raw', 'orders') }}
|
|
where status != 'cancelled'
|
|
```
|
|
|
|
```sql
|
|
-- models/marts/core/fct_orders.sql
|
|
{{ config(
|
|
materialized='incremental',
|
|
unique_key='order_id',
|
|
on_schema_change='fail'
|
|
) }}
|
|
|
|
select
|
|
o.order_id,
|
|
o.user_id,
|
|
u.email,
|
|
o.amount,
|
|
o.created_at
|
|
from {{ ref('stg_orders') }} o
|
|
left join {{ ref('dim_users') }} u using (user_id)
|
|
|
|
{% if is_incremental() %}
|
|
where o.created_at > (select max(created_at) from {{ this }})
|
|
{% endif %}
|
|
```
|
|
|
|
### sources / schema
|
|
```yaml
|
|
# models/staging/sources.yml
|
|
version: 2
|
|
|
|
sources:
|
|
- name: raw
|
|
database: production
|
|
schema: public
|
|
tables:
|
|
- name: orders
|
|
loaded_at_field: _ingested_at
|
|
freshness:
|
|
warn_after: { count: 1, period: hour }
|
|
error_after: { count: 4, period: hour }
|
|
- name: users
|
|
```
|
|
|
|
### Tests
|
|
```yaml
|
|
# models/marts/core/schema.yml
|
|
version: 2
|
|
|
|
models:
|
|
- name: dim_users
|
|
description: User dimension
|
|
columns:
|
|
- name: user_id
|
|
description: Primary key
|
|
tests:
|
|
- unique
|
|
- not_null
|
|
- name: email
|
|
tests:
|
|
- not_null
|
|
- unique
|
|
- name: plan
|
|
tests:
|
|
- accepted_values:
|
|
values: ['free', 'pro', 'enterprise']
|
|
|
|
- name: fct_orders
|
|
columns:
|
|
- name: order_id
|
|
tests:
|
|
- unique
|
|
- not_null
|
|
- name: user_id
|
|
tests:
|
|
- relationships:
|
|
to: ref('dim_users')
|
|
field: user_id
|
|
```
|
|
|
|
### Custom test
|
|
```sql
|
|
-- tests/positive_amount.sql
|
|
select * from {{ ref('fct_orders') }}
|
|
where amount <= 0
|
|
```
|
|
|
|
### Macro (재사용)
|
|
```sql
|
|
-- macros/clean_string.sql
|
|
{% macro clean_string(col) %}
|
|
trim(lower({{ col }}))
|
|
{% endmacro %}
|
|
|
|
-- 사용
|
|
select {{ clean_string('email') }} as email_clean
|
|
from {{ ref('stg_users') }}
|
|
```
|
|
|
|
### Materialization 종류
|
|
```
|
|
view: SELECT 마다 — 작은 / 자주 변경
|
|
table: 전체 rebuild — 작은-중간
|
|
incremental: 변경 분만 추가 — 큰 fact
|
|
ephemeral: CTE inline — temp transformation
|
|
snapshot: SCD Type 2 (history)
|
|
```
|
|
|
|
### Snapshot (SCD Type 2 — history)
|
|
```sql
|
|
-- snapshots/users_snapshot.sql
|
|
{% snapshot users_snapshot %}
|
|
{{ config(
|
|
target_schema='snapshots',
|
|
unique_key='user_id',
|
|
strategy='check',
|
|
check_cols=['email', 'plan'],
|
|
) }}
|
|
select user_id, email, plan, updated_at from {{ source('raw', 'users') }}
|
|
{% endsnapshot %}
|
|
```
|
|
|
|
```bash
|
|
dbt snapshot
|
|
```
|
|
|
|
→ 변경 추적: `dbt_valid_from`, `dbt_valid_to`.
|
|
|
|
### Seed (작은 lookup table)
|
|
```csv
|
|
# seeds/country_codes.csv
|
|
country_code,country_name
|
|
US,United States
|
|
KR,Korea
|
|
JP,Japan
|
|
```
|
|
|
|
```bash
|
|
dbt seed
|
|
```
|
|
|
|
```sql
|
|
select * from {{ ref('country_codes') }}
|
|
```
|
|
|
|
### Run / test
|
|
```bash
|
|
dbt run # 모든 model 빌드
|
|
dbt run --select stg_orders+ # stg_orders 와 downstream
|
|
dbt run --select +dim_users # dim_users 와 upstream
|
|
|
|
dbt test
|
|
dbt test --select dim_users
|
|
|
|
# 첫 build = full, 이후 = incremental
|
|
dbt build # run + test 한 번
|
|
```
|
|
|
|
### Docs
|
|
```bash
|
|
dbt docs generate
|
|
dbt docs serve # web UI
|
|
|
|
# Lineage graph + column descriptions
|
|
```
|
|
|
|
### CI
|
|
```yaml
|
|
- run: dbt deps
|
|
- run: dbt seed --target ci
|
|
- run: dbt run --target ci
|
|
- run: dbt test --target ci
|
|
- run: dbt source freshness # source 신선도
|
|
```
|
|
|
|
```yaml
|
|
# .github/workflows/dbt.yml
|
|
- run: dbt build --select state:modified+ --defer --state ./prod-manifest
|
|
```
|
|
|
|
→ 변경된 model + downstream 만 build.
|
|
|
|
### Adapter (warehouse)
|
|
```
|
|
dbt-postgres, dbt-snowflake, dbt-bigquery, dbt-redshift, dbt-databricks, dbt-duckdb
|
|
```
|
|
|
|
→ 같은 코드 다른 warehouse.
|
|
|
|
### profiles.yml (connection)
|
|
```yaml
|
|
my_project:
|
|
outputs:
|
|
dev:
|
|
type: postgres
|
|
host: localhost
|
|
user: dev
|
|
password: "{{ env_var('DB_PW') }}"
|
|
dbname: dev
|
|
schema: dbt_dev
|
|
prod:
|
|
type: postgres
|
|
host: prod.example.com
|
|
user: dbt_prod
|
|
schema: analytics
|
|
target: dev
|
|
```
|
|
|
|
### Performance
|
|
```sql
|
|
-- Incremental + clustering / partition
|
|
{{ config(
|
|
materialized='incremental',
|
|
unique_key='order_id',
|
|
incremental_strategy='merge',
|
|
partition_by={'field': 'date', 'data_type': 'date'},
|
|
cluster_by=['user_id'],
|
|
) }}
|
|
```
|
|
|
|
→ BigQuery / Snowflake clustering.
|
|
|
|
### dbt Cloud / Mesh (큰 조직)
|
|
- 매니지드 dbt run.
|
|
- Cross-project (dbt Mesh).
|
|
- IDE in browser.
|
|
- Scheduling.
|
|
|
|
→ 또는 Airflow / Dagster 가 dbt 호출.
|
|
|
|
### Dagster + dbt
|
|
```python
|
|
from dagster_dbt import dbt_assets, DbtCliResource
|
|
|
|
@dbt_assets(manifest=Path('target/manifest.json'))
|
|
def my_dbt_assets(context, dbt: DbtCliResource):
|
|
yield from dbt.cli(['build'], context=context).stream()
|
|
```
|
|
|
|
→ dbt model 가 Dagster asset 으로 자동.
|
|
|
|
## 🤔 의사결정 기준
|
|
| 작업 | 추천 |
|
|
|---|---|
|
|
| SQL transform | dbt |
|
|
| Python ML pipeline | Dagster / Airflow |
|
|
| Streaming | Flink / Spark Structured |
|
|
| Schema migration | dbt-snapshot 또는 dbt-history |
|
|
| 작은 / 단일 SQL | 직접 |
|
|
| 큰 organization | dbt Cloud / Dagster |
|
|
|
|
## ❌ 안티패턴
|
|
- **모든 model `materialized='table'`**: rebuild 비쌈. incremental.
|
|
- **Test 없는 model**: 데이터 quality 모름.
|
|
- **Source freshness 없음**: stale 데이터 모름.
|
|
- **Schema.yml 없음**: column descriptions 없음.
|
|
- **Macro 남발**: 가독성 떨어짐.
|
|
- **Production 에 직접 dbt run**: scheduler 없이.
|
|
- **Seed 큰 데이터 (>10MB)**: ETL 로.
|
|
|
|
## 🤖 LLM 활용 힌트
|
|
- staging / marts 분리 표준.
|
|
- ref / source 항상.
|
|
- Test (unique, not_null, relationships) 필수.
|
|
- Incremental 큰 table.
|
|
|
|
## 🔗 관련 문서
|
|
- [[Data_Eng_Airflow_Dagster]]
|
|
- [[Data_Eng_Lakehouse]]
|
|
- [[DB_ClickHouse_OLAP]]
|