--- id: data-eng-dbt title: dbt — SQL Transform / Test / Doc category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [data-engineering, dbt, sql, vibe-coding] tech_stack: { language: "SQL / Jinja", applicable_to: ["Data Engineering"] } applied_in: [] aliases: [dbt, dbt-core, model, source, seed, snapshot, macro] --- # dbt (data build tool) > SQL transform 의 modern 표준. **Model = SELECT, ref / source = lineage, test = data quality, docs = 자동**. ELT (Extract Load Transform) 의 T. ## 📖 핵심 개념 - Model: `.sql` file = SELECT → table / view. - Source: 외부 raw table. - Ref: 다른 model 참조. - Test: 데이터 품질 검사. - Snapshot: SCD Type 2. ## 💻 코드 패턴 ### 폴더 구조 ``` dbt_project/ ├── dbt_project.yml ├── models/ │ ├── staging/ │ │ ├── stg_orders.sql │ │ └── stg_users.sql │ ├── marts/ │ │ ├── core/ │ │ │ ├── dim_users.sql │ │ │ └── fct_orders.sql │ │ └── finance/ │ │ └── revenue_daily.sql │ └── schema.yml ├── macros/ ├── tests/ ├── seeds/ └── snapshots/ ``` ### Model ```sql -- models/staging/stg_orders.sql {{ config(materialized='view') }} select id as order_id, user_id, amount, status, created_at from {{ source('raw', 'orders') }} where status != 'cancelled' ``` ```sql -- models/marts/core/fct_orders.sql {{ config( materialized='incremental', unique_key='order_id', on_schema_change='fail' ) }} select o.order_id, o.user_id, u.email, o.amount, o.created_at from {{ ref('stg_orders') }} o left join {{ ref('dim_users') }} u using (user_id) {% if is_incremental() %} where o.created_at > (select max(created_at) from {{ this }}) {% endif %} ``` ### sources / schema ```yaml # models/staging/sources.yml version: 2 sources: - name: raw database: production schema: public tables: - name: orders loaded_at_field: _ingested_at freshness: warn_after: { count: 1, period: hour } error_after: { count: 4, period: hour } - name: users ``` ### Tests ```yaml # models/marts/core/schema.yml version: 2 models: - name: dim_users description: User dimension columns: - name: user_id description: Primary key tests: - unique - not_null - name: email tests: - not_null - unique - name: plan tests: - accepted_values: values: ['free', 'pro', 'enterprise'] - name: fct_orders columns: - name: order_id tests: - unique - not_null - name: user_id tests: - relationships: to: ref('dim_users') field: user_id ``` ### Custom test ```sql -- tests/positive_amount.sql select * from {{ ref('fct_orders') }} where amount <= 0 ``` ### Macro (재사용) ```sql -- macros/clean_string.sql {% macro clean_string(col) %} trim(lower({{ col }})) {% endmacro %} -- 사용 select {{ clean_string('email') }} as email_clean from {{ ref('stg_users') }} ``` ### Materialization 종류 ``` view: SELECT 마다 — 작은 / 자주 변경 table: 전체 rebuild — 작은-중간 incremental: 변경 분만 추가 — 큰 fact ephemeral: CTE inline — temp transformation snapshot: SCD Type 2 (history) ``` ### Snapshot (SCD Type 2 — history) ```sql -- snapshots/users_snapshot.sql {% snapshot users_snapshot %} {{ config( target_schema='snapshots', unique_key='user_id', strategy='check', check_cols=['email', 'plan'], ) }} select user_id, email, plan, updated_at from {{ source('raw', 'users') }} {% endsnapshot %} ``` ```bash dbt snapshot ``` → 변경 추적: `dbt_valid_from`, `dbt_valid_to`. ### Seed (작은 lookup table) ```csv # seeds/country_codes.csv country_code,country_name US,United States KR,Korea JP,Japan ``` ```bash dbt seed ``` ```sql select * from {{ ref('country_codes') }} ``` ### Run / test ```bash dbt run # 모든 model 빌드 dbt run --select stg_orders+ # stg_orders 와 downstream dbt run --select +dim_users # dim_users 와 upstream dbt test dbt test --select dim_users # 첫 build = full, 이후 = incremental dbt build # run + test 한 번 ``` ### Docs ```bash dbt docs generate dbt docs serve # web UI # Lineage graph + column descriptions ``` ### CI ```yaml - run: dbt deps - run: dbt seed --target ci - run: dbt run --target ci - run: dbt test --target ci - run: dbt source freshness # source 신선도 ``` ```yaml # .github/workflows/dbt.yml - run: dbt build --select state:modified+ --defer --state ./prod-manifest ``` → 변경된 model + downstream 만 build. ### Adapter (warehouse) ``` dbt-postgres, dbt-snowflake, dbt-bigquery, dbt-redshift, dbt-databricks, dbt-duckdb ``` → 같은 코드 다른 warehouse. ### profiles.yml (connection) ```yaml my_project: outputs: dev: type: postgres host: localhost user: dev password: "{{ env_var('DB_PW') }}" dbname: dev schema: dbt_dev prod: type: postgres host: prod.example.com user: dbt_prod schema: analytics target: dev ``` ### Performance ```sql -- Incremental + clustering / partition {{ config( materialized='incremental', unique_key='order_id', incremental_strategy='merge', partition_by={'field': 'date', 'data_type': 'date'}, cluster_by=['user_id'], ) }} ``` → BigQuery / Snowflake clustering. ### dbt Cloud / Mesh (큰 조직) - 매니지드 dbt run. - Cross-project (dbt Mesh). - IDE in browser. - Scheduling. → 또는 Airflow / Dagster 가 dbt 호출. ### Dagster + dbt ```python from dagster_dbt import dbt_assets, DbtCliResource @dbt_assets(manifest=Path('target/manifest.json')) def my_dbt_assets(context, dbt: DbtCliResource): yield from dbt.cli(['build'], context=context).stream() ``` → dbt model 가 Dagster asset 으로 자동. ## 🤔 의사결정 기준 | 작업 | 추천 | |---|---| | SQL transform | dbt | | Python ML pipeline | Dagster / Airflow | | Streaming | Flink / Spark Structured | | Schema migration | dbt-snapshot 또는 dbt-history | | 작은 / 단일 SQL | 직접 | | 큰 organization | dbt Cloud / Dagster | ## ❌ 안티패턴 - **모든 model `materialized='table'`**: rebuild 비쌈. incremental. - **Test 없는 model**: 데이터 quality 모름. - **Source freshness 없음**: stale 데이터 모름. - **Schema.yml 없음**: column descriptions 없음. - **Macro 남발**: 가독성 떨어짐. - **Production 에 직접 dbt run**: scheduler 없이. - **Seed 큰 데이터 (>10MB)**: ETL 로. ## 🤖 LLM 활용 힌트 - staging / marts 분리 표준. - ref / source 항상. - Test (unique, not_null, relationships) 필수. - Incremental 큰 table. ## 🔗 관련 문서 - [[Data_Eng_Airflow_Dagster]] - [[Data_Eng_Lakehouse]] - [[DB_ClickHouse_OLAP]]