Files
2nd/10_Wiki/Topics/Backend/Snowflake-Data-Warehousing.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.3 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-snowflake-data-warehousing Snowflake Data Warehousing 10_Wiki/Topics verified self
Snowflake
Snowflake DW
Snowflake Cloud Data Platform
none A 0.9 applied
database
data-warehouse
cloud
analytics
2026-05-10 pending
language framework
sql snowflake

Snowflake Data Warehousing

매 한 줄

"매 storage 매 separated · 매 compute 매 elastic". Snowflake는 매 multi-cluster shared-data architecture 의 cloud DW — micro-partition columnar storage · virtual warehouse · zero-copy clone · time travel · Iceberg 매 native(2026). Databricks · BigQuery · Redshift 매 big-3 경쟁.

매 핵심

매 Architecture (3 layers)

  • Storage: S3/GCS/Blob 매 micro-partitions(50500MB), columnar(FDN). 매 compressed.
  • Compute (Virtual Warehouses): independent compute clusters, X-Small ~ 6X-Large. 매 per-second billed.
  • Cloud Services: metadata · query optimization · auth · 매 stateless brain.

매 Key features

  • Zero-copy clone: instant DB/schema/table copy via metadata.
  • Time Travel: query as of 90-day past (Enterprise: 90, default 1).
  • Streams + Tasks: CDC + scheduled SQL = native pipeline.
  • Snowpark: Python/Scala/Java in-DB compute.
  • Iceberg tables (2026): external open-table format.
  • Cortex AI: built-in LLM functions.

매 응용

  1. Analytical workloads (OLAP, BI).
  2. Data sharing (Secure Data Share — no copy).
  3. ELT with dbt.
  4. ML feature engineering (Snowpark + Cortex).

💻 패턴

Warehouse sizing & auto-suspend

CREATE WAREHOUSE etl_wh
  WAREHOUSE_SIZE = 'MEDIUM'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE
  MIN_CLUSTER_COUNT = 1
  MAX_CLUSTER_COUNT = 4
  SCALING_POLICY = 'STANDARD';

Copy from S3 (bulk load)

CREATE STAGE my_stage URL='s3://bucket/path/' STORAGE_INTEGRATION = my_int;
COPY INTO orders
  FROM @my_stage/orders/
  FILE_FORMAT = (TYPE = PARQUET)
  MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE
  ON_ERROR = 'CONTINUE';

Zero-copy clone for testing

-- 매 instant · 매 storage 추가 X (copy-on-write)
CREATE DATABASE prod_clone CLONE prod;
-- 매 dbt CI 매 패턴

Time travel + undrop

SELECT * FROM orders AT (OFFSET => -60*5);          -- 5분 전
SELECT * FROM orders BEFORE (STATEMENT => '01a...');
UNDROP TABLE orders;                                 -- 매 within retention

Streams + Tasks (CDC pipeline)

CREATE STREAM orders_stream ON TABLE orders;
CREATE TASK orders_etl
  WAREHOUSE = etl_wh
  SCHEDULE = '5 MINUTE'
  WHEN SYSTEM$STREAM_HAS_DATA('orders_stream')
AS
  INSERT INTO orders_silver
  SELECT *, CURRENT_TIMESTAMP() AS ingest_ts
  FROM orders_stream;
ALTER TASK orders_etl RESUME;

Snowpark Python (in-DB compute)

from snowflake.snowpark import Session, functions as F

sess = Session.builder.configs(cfg).create()
df = sess.table('orders') \
  .filter(F.col('amount') > 100) \
  .group_by('customer_id') \
  .agg(F.sum('amount').alias('total'))
df.write.save_as_table('top_customers', mode='overwrite')

Cortex AI (LLM in SQL)

SELECT order_id,
  SNOWFLAKE.CORTEX.SUMMARIZE(review_text) AS summary,
  SNOWFLAKE.CORTEX.SENTIMENT(review_text) AS sentiment
FROM reviews;

-- Free-text classify
SELECT SNOWFLAKE.CORTEX.CLASSIFY_TEXT(
  ticket_body,
  ['billing','technical','refund','other']
) FROM tickets;

Iceberg external table (2026)

CREATE ICEBERG TABLE events
  CATALOG = my_glue
  EXTERNAL_VOLUME = my_s3_vol
  CATALOG_TABLE_NAME = 'analytics.events';
-- 매 Snowflake/Spark/Trino 매 same data.

Cost optimization (Resource Monitor)

CREATE RESOURCE MONITOR rm_dev
  WITH CREDIT_QUOTA = 100
  TRIGGERS
    ON 80 PERCENT DO NOTIFY
    ON 100 PERCENT DO SUSPEND;
ALTER WAREHOUSE etl_wh SET RESOURCE_MONITOR = rm_dev;

매 결정 기준

상황 Choice
BI / dashboards Snowflake + dbt
Open lakehouse Iceberg + Snowflake/Databricks
Spark-heavy ML Databricks
GCP-native BigQuery
Sub-second OLAP ClickHouse / Druid
Tiny data <100GB Postgres + DuckDB

기본값: Snowflake + dbt + Iceberg (open + managed).

🔗 Graph

🤖 LLM 활용

언제: SQL tuning suggestion, dbt model scaffolding, Cortex function selection. 언제 X: production query 매 직접 실행 — 매 EXPLAIN + governance review.

안티패턴

  • Always-on warehouse: AUTO_SUSPEND 미설정 → cost 폭발.
  • SELECT * on wide table: columnar 의 이점 매 손실.
  • One huge warehouse: workload isolation X — ETL 매 BI 매 contend.
  • No clustering on huge table: prune 매 작동 X — full scan.
  • Copy data instead of Data Share: governance · cost penalty.

🧪 검증 / 중복

  • Verified (Snowflake docs 2026; Dageville et al. SIGMOD 2016; Snowflake: The Definitive Guide 2nd ed).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — full content (architecture + 9 patterns)