Files
2nd/10_Wiki/Topics/Backend/Snowflake-Data-Warehousing.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

181 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-snowflake-data-warehousing
title: Snowflake Data Warehousing
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Snowflake, Snowflake DW, Snowflake Cloud Data Platform]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [database, data-warehouse, cloud, analytics]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: sql
framework: snowflake
---
# Snowflake Data Warehousing
## 매 한 줄
> **"매 storage 매 separated · 매 compute 매 elastic"**. Snowflake는 매 multi-cluster shared-data architecture 의 cloud DW — micro-partition columnar storage · virtual warehouse · zero-copy clone · time travel · Iceberg 매 native(2026). Databricks · BigQuery · Redshift 매 big-3 경쟁.
## 매 핵심
### 매 Architecture (3 layers)
- **Storage**: S3/GCS/Blob 매 micro-partitions(50500MB), columnar(FDN). 매 compressed.
- **Compute (Virtual Warehouses)**: independent compute clusters, X-Small ~ 6X-Large. 매 per-second billed.
- **Cloud Services**: metadata · query optimization · auth · 매 stateless brain.
### 매 Key features
- **Zero-copy clone**: instant DB/schema/table copy via metadata.
- **Time Travel**: query as of 90-day past (Enterprise: 90, default 1).
- **Streams + Tasks**: CDC + scheduled SQL = native pipeline.
- **Snowpark**: Python/Scala/Java in-DB compute.
- **Iceberg tables (2026)**: external open-table format.
- **Cortex AI**: built-in LLM functions.
### 매 응용
1. Analytical workloads (OLAP, BI).
2. Data sharing (Secure Data Share — no copy).
3. ELT with dbt.
4. ML feature engineering (Snowpark + Cortex).
## 💻 패턴
### Warehouse sizing & auto-suspend
```sql
CREATE WAREHOUSE etl_wh
WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 4
SCALING_POLICY = 'STANDARD';
```
### Copy from S3 (bulk load)
```sql
CREATE STAGE my_stage URL='s3://bucket/path/' STORAGE_INTEGRATION = my_int;
COPY INTO orders
FROM @my_stage/orders/
FILE_FORMAT = (TYPE = PARQUET)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE
ON_ERROR = 'CONTINUE';
```
### Zero-copy clone for testing
```sql
-- 매 instant · 매 storage 추가 X (copy-on-write)
CREATE DATABASE prod_clone CLONE prod;
-- 매 dbt CI 매 패턴
```
### Time travel + undrop
```sql
SELECT * FROM orders AT (OFFSET => -60*5); -- 5분 전
SELECT * FROM orders BEFORE (STATEMENT => '01a...');
UNDROP TABLE orders; -- 매 within retention
```
### Streams + Tasks (CDC pipeline)
```sql
CREATE STREAM orders_stream ON TABLE orders;
CREATE TASK orders_etl
WAREHOUSE = etl_wh
SCHEDULE = '5 MINUTE'
WHEN SYSTEM$STREAM_HAS_DATA('orders_stream')
AS
INSERT INTO orders_silver
SELECT *, CURRENT_TIMESTAMP() AS ingest_ts
FROM orders_stream;
ALTER TASK orders_etl RESUME;
```
### Snowpark Python (in-DB compute)
```python
from snowflake.snowpark import Session, functions as F
sess = Session.builder.configs(cfg).create()
df = sess.table('orders') \
.filter(F.col('amount') > 100) \
.group_by('customer_id') \
.agg(F.sum('amount').alias('total'))
df.write.save_as_table('top_customers', mode='overwrite')
```
### Cortex AI (LLM in SQL)
```sql
SELECT order_id,
SNOWFLAKE.CORTEX.SUMMARIZE(review_text) AS summary,
SNOWFLAKE.CORTEX.SENTIMENT(review_text) AS sentiment
FROM reviews;
-- Free-text classify
SELECT SNOWFLAKE.CORTEX.CLASSIFY_TEXT(
ticket_body,
['billing','technical','refund','other']
) FROM tickets;
```
### Iceberg external table (2026)
```sql
CREATE ICEBERG TABLE events
CATALOG = my_glue
EXTERNAL_VOLUME = my_s3_vol
CATALOG_TABLE_NAME = 'analytics.events';
-- 매 Snowflake/Spark/Trino 매 same data.
```
### Cost optimization (Resource Monitor)
```sql
CREATE RESOURCE MONITOR rm_dev
WITH CREDIT_QUOTA = 100
TRIGGERS
ON 80 PERCENT DO NOTIFY
ON 100 PERCENT DO SUSPEND;
ALTER WAREHOUSE etl_wh SET RESOURCE_MONITOR = rm_dev;
```
## 매 결정 기준
| 상황 | Choice |
|---|---|
| BI / dashboards | Snowflake + dbt |
| Open lakehouse | Iceberg + Snowflake/Databricks |
| Spark-heavy ML | Databricks |
| GCP-native | BigQuery |
| Sub-second OLAP | ClickHouse / Druid |
| Tiny data <100GB | Postgres + DuckDB |
**기본값**: Snowflake + dbt + Iceberg (open + managed).
## 🔗 Graph
- 부모: [[Data Warehouse]] · [[Cloud Native]]
- 변형: [[ClickHouse]]
- 응용: [[Feature Store]]
- Adjacent: [[Apache Iceberg]] · [[dbt]] · [[Principles of Data Connect]]
## 🤖 LLM 활용
**언제**: SQL tuning suggestion, dbt model scaffolding, Cortex function selection.
**언제 X**: production query 매 직접 실행 — 매 EXPLAIN + governance review.
## ❌ 안티패턴
- **Always-on warehouse**: AUTO_SUSPEND 미설정 → cost 폭발.
- **SELECT * on wide table**: columnar 의 이점 매 손실.
- **One huge warehouse**: workload isolation X — ETL 매 BI 매 contend.
- **No clustering on huge table**: prune 매 작동 X — full scan.
- **Copy data instead of Data Share**: governance · cost penalty.
## 🧪 검증 / 중복
- Verified (Snowflake docs 2026; Dageville et al. SIGMOD 2016; *Snowflake: The Definitive Guide* 2nd ed).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — full content (architecture + 9 patterns) |