7.0 KiB
7.0 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| data-eng-dbt | dbt — SQL Transform / Test / Doc | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
dbt (data build tool)
SQL transform 의 modern 표준. Model = SELECT, ref / source = lineage, test = data quality, docs = 자동. ELT (Extract Load Transform) 의 T.
📖 핵심 개념
- Model:
.sqlfile = SELECT → table / view. - Source: 외부 raw table.
- Ref: 다른 model 참조.
- Test: 데이터 품질 검사.
- Snapshot: SCD Type 2.
💻 코드 패턴
폴더 구조
dbt_project/
├── dbt_project.yml
├── models/
│ ├── staging/
│ │ ├── stg_orders.sql
│ │ └── stg_users.sql
│ ├── marts/
│ │ ├── core/
│ │ │ ├── dim_users.sql
│ │ │ └── fct_orders.sql
│ │ └── finance/
│ │ └── revenue_daily.sql
│ └── schema.yml
├── macros/
├── tests/
├── seeds/
└── snapshots/
Model
-- models/staging/stg_orders.sql
{{ config(materialized='view') }}
select
id as order_id,
user_id,
amount,
status,
created_at
from {{ source('raw', 'orders') }}
where status != 'cancelled'
-- models/marts/core/fct_orders.sql
{{ config(
materialized='incremental',
unique_key='order_id',
on_schema_change='fail'
) }}
select
o.order_id,
o.user_id,
u.email,
o.amount,
o.created_at
from {{ ref('stg_orders') }} o
left join {{ ref('dim_users') }} u using (user_id)
{% if is_incremental() %}
where o.created_at > (select max(created_at) from {{ this }})
{% endif %}
sources / schema
# models/staging/sources.yml
version: 2
sources:
- name: raw
database: production
schema: public
tables:
- name: orders
loaded_at_field: _ingested_at
freshness:
warn_after: { count: 1, period: hour }
error_after: { count: 4, period: hour }
- name: users
Tests
# models/marts/core/schema.yml
version: 2
models:
- name: dim_users
description: User dimension
columns:
- name: user_id
description: Primary key
tests:
- unique
- not_null
- name: email
tests:
- not_null
- unique
- name: plan
tests:
- accepted_values:
values: ['free', 'pro', 'enterprise']
- name: fct_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: user_id
tests:
- relationships:
to: ref('dim_users')
field: user_id
Custom test
-- tests/positive_amount.sql
select * from {{ ref('fct_orders') }}
where amount <= 0
Macro (재사용)
-- macros/clean_string.sql
{% macro clean_string(col) %}
trim(lower({{ col }}))
{% endmacro %}
-- 사용
select {{ clean_string('email') }} as email_clean
from {{ ref('stg_users') }}
Materialization 종류
view: SELECT 마다 — 작은 / 자주 변경
table: 전체 rebuild — 작은-중간
incremental: 변경 분만 추가 — 큰 fact
ephemeral: CTE inline — temp transformation
snapshot: SCD Type 2 (history)
Snapshot (SCD Type 2 — history)
-- snapshots/users_snapshot.sql
{% snapshot users_snapshot %}
{{ config(
target_schema='snapshots',
unique_key='user_id',
strategy='check',
check_cols=['email', 'plan'],
) }}
select user_id, email, plan, updated_at from {{ source('raw', 'users') }}
{% endsnapshot %}
dbt snapshot
→ 변경 추적: dbt_valid_from, dbt_valid_to.
Seed (작은 lookup table)
# seeds/country_codes.csv
country_code,country_name
US,United States
KR,Korea
JP,Japan
dbt seed
select * from {{ ref('country_codes') }}
Run / test
dbt run # 모든 model 빌드
dbt run --select stg_orders+ # stg_orders 와 downstream
dbt run --select +dim_users # dim_users 와 upstream
dbt test
dbt test --select dim_users
# 첫 build = full, 이후 = incremental
dbt build # run + test 한 번
Docs
dbt docs generate
dbt docs serve # web UI
# Lineage graph + column descriptions
CI
- run: dbt deps
- run: dbt seed --target ci
- run: dbt run --target ci
- run: dbt test --target ci
- run: dbt source freshness # source 신선도
# .github/workflows/dbt.yml
- run: dbt build --select state:modified+ --defer --state ./prod-manifest
→ 변경된 model + downstream 만 build.
Adapter (warehouse)
dbt-postgres, dbt-snowflake, dbt-bigquery, dbt-redshift, dbt-databricks, dbt-duckdb
→ 같은 코드 다른 warehouse.
profiles.yml (connection)
my_project:
outputs:
dev:
type: postgres
host: localhost
user: dev
password: "{{ env_var('DB_PW') }}"
dbname: dev
schema: dbt_dev
prod:
type: postgres
host: prod.example.com
user: dbt_prod
schema: analytics
target: dev
Performance
-- Incremental + clustering / partition
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
partition_by={'field': 'date', 'data_type': 'date'},
cluster_by=['user_id'],
) }}
→ BigQuery / Snowflake clustering.
dbt Cloud / Mesh (큰 조직)
- 매니지드 dbt run.
- Cross-project (dbt Mesh).
- IDE in browser.
- Scheduling.
→ 또는 Airflow / Dagster 가 dbt 호출.
Dagster + dbt
from dagster_dbt import dbt_assets, DbtCliResource
@dbt_assets(manifest=Path('target/manifest.json'))
def my_dbt_assets(context, dbt: DbtCliResource):
yield from dbt.cli(['build'], context=context).stream()
→ dbt model 가 Dagster asset 으로 자동.
🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| SQL transform | dbt |
| Python ML pipeline | Dagster / Airflow |
| Streaming | Flink / Spark Structured |
| Schema migration | dbt-snapshot 또는 dbt-history |
| 작은 / 단일 SQL | 직접 |
| 큰 organization | dbt Cloud / Dagster |
❌ 안티패턴
- 모든 model
materialized='table': rebuild 비쌈. incremental. - Test 없는 model: 데이터 quality 모름.
- Source freshness 없음: stale 데이터 모름.
- Schema.yml 없음: column descriptions 없음.
- Macro 남발: 가독성 떨어짐.
- Production 에 직접 dbt run: scheduler 없이.
- Seed 큰 데이터 (>10MB): ETL 로.
🤖 LLM 활용 힌트
- staging / marts 분리 표준.
- ref / source 항상.
- Test (unique, not_null, relationships) 필수.
- Incremental 큰 table.