[G1-Sync] Manual knowledge update

This commit is contained in:
Antigravity Agent
2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -2,24 +2,179 @@
id: wiki-2026-0508-snowflake-data-warehousing
title: Snowflake Data Warehousing
category: 10_Wiki/Topics
status: merged
redirect_to: 데이터_엔지니어링_및_가상_인프라_표준
canonical_id: wiki-2026-0508-001
aliases: []
status: verified
canonical_id: self
aliases: [Snowflake, Snowflake DW, Snowflake Cloud Data Platform]
duplicate_of: none
source_trust_level: A
confidence_score: 0.92
tags: [uncategorized]
confidence_score: 0.9
verification_status: applied
tags: [database, data-warehouse, cloud, analytics]
raw_sources: []
last_reinforced: 2026-05-08
last_reinforced: 2026-05-10
github_commit: pending
inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
tech_stack:
language: unspecified
framework: unspecified
language: sql
framework: snowflake
---
# Redirect
# Snowflake Data Warehousing
이 문서는 Canonical 문서인 [[데이터_엔지니어링_및_가상_인프라_표준]]으로 통합되었습니다.
모든 최신 지식과 세부 내용은 위 링크를 참조하십시오.
## 매 한 줄
> **"매 storage 매 separated · 매 compute 매 elastic"**. Snowflake는 매 multi-cluster shared-data architecture 의 cloud DW — micro-partition columnar storage · virtual warehouse · zero-copy clone · time travel · Iceberg 매 native(2026). Databricks · BigQuery · Redshift 매 big-3 경쟁.
## 매 핵심
### 매 Architecture (3 layers)
- **Storage**: S3/GCS/Blob 매 micro-partitions(50500MB), columnar(FDN). 매 compressed.
- **Compute (Virtual Warehouses)**: independent compute clusters, X-Small ~ 6X-Large. 매 per-second billed.
- **Cloud Services**: metadata · query optimization · auth · 매 stateless brain.
### 매 Key features
- **Zero-copy clone**: instant DB/schema/table copy via metadata.
- **Time Travel**: query as of 90-day past (Enterprise: 90, default 1).
- **Streams + Tasks**: CDC + scheduled SQL = native pipeline.
- **Snowpark**: Python/Scala/Java in-DB compute.
- **Iceberg tables (2026)**: external open-table format.
- **Cortex AI**: built-in LLM functions.
### 매 응용
1. Analytical workloads (OLAP, BI).
2. Data sharing (Secure Data Share — no copy).
3. ELT with dbt.
4. ML feature engineering (Snowpark + Cortex).
## 💻 패턴
### Warehouse sizing & auto-suspend
```sql
CREATE WAREHOUSE etl_wh
WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 4
SCALING_POLICY = 'STANDARD';
```
### Copy from S3 (bulk load)
```sql
CREATE STAGE my_stage URL='s3://bucket/path/' STORAGE_INTEGRATION = my_int;
COPY INTO orders
FROM @my_stage/orders/
FILE_FORMAT = (TYPE = PARQUET)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE
ON_ERROR = 'CONTINUE';
```
### Zero-copy clone for testing
```sql
-- 매 instant · 매 storage 추가 X (copy-on-write)
CREATE DATABASE prod_clone CLONE prod;
-- 매 dbt CI 매 패턴
```
### Time travel + undrop
```sql
SELECT * FROM orders AT (OFFSET => -60*5); -- 5분 전
SELECT * FROM orders BEFORE (STATEMENT => '01a...');
UNDROP TABLE orders; -- 매 within retention
```
### Streams + Tasks (CDC pipeline)
```sql
CREATE STREAM orders_stream ON TABLE orders;
CREATE TASK orders_etl
WAREHOUSE = etl_wh
SCHEDULE = '5 MINUTE'
WHEN SYSTEM$STREAM_HAS_DATA('orders_stream')
AS
INSERT INTO orders_silver
SELECT *, CURRENT_TIMESTAMP() AS ingest_ts
FROM orders_stream;
ALTER TASK orders_etl RESUME;
```
### Snowpark Python (in-DB compute)
```python
from snowflake.snowpark import Session, functions as F
sess = Session.builder.configs(cfg).create()
df = sess.table('orders') \
.filter(F.col('amount') > 100) \
.group_by('customer_id') \
.agg(F.sum('amount').alias('total'))
df.write.save_as_table('top_customers', mode='overwrite')
```
### Cortex AI (LLM in SQL)
```sql
SELECT order_id,
SNOWFLAKE.CORTEX.SUMMARIZE(review_text) AS summary,
SNOWFLAKE.CORTEX.SENTIMENT(review_text) AS sentiment
FROM reviews;
-- Free-text classify
SELECT SNOWFLAKE.CORTEX.CLASSIFY_TEXT(
ticket_body,
['billing','technical','refund','other']
) FROM tickets;
```
### Iceberg external table (2026)
```sql
CREATE ICEBERG TABLE events
CATALOG = my_glue
EXTERNAL_VOLUME = my_s3_vol
CATALOG_TABLE_NAME = 'analytics.events';
-- 매 Snowflake/Spark/Trino 매 same data.
```
### Cost optimization (Resource Monitor)
```sql
CREATE RESOURCE MONITOR rm_dev
WITH CREDIT_QUOTA = 100
TRIGGERS
ON 80 PERCENT DO NOTIFY
ON 100 PERCENT DO SUSPEND;
ALTER WAREHOUSE etl_wh SET RESOURCE_MONITOR = rm_dev;
```
## 매 결정 기준
| 상황 | Choice |
|---|---|
| BI / dashboards | Snowflake + dbt |
| Open lakehouse | Iceberg + Snowflake/Databricks |
| Spark-heavy ML | Databricks |
| GCP-native | BigQuery |
| Sub-second OLAP | ClickHouse / Druid |
| Tiny data <100GB | Postgres + DuckDB |
**기본값**: Snowflake + dbt + Iceberg (open + managed).
## 🔗 Graph
- 부모: [[Data Warehouse]] · [[Cloud Native]]
- 변형: [[BigQuery]] · [[Databricks]] · [[Redshift]] · [[ClickHouse]]
- 응용: [[ELT Pattern]] · [[Data Sharing]] · [[Feature Store]]
- Adjacent: [[Apache Iceberg]] · [[dbt]] · [[Snowpark]] · [[Principles of Data Connect]]
## 🤖 LLM 활용
**언제**: SQL tuning suggestion, dbt model scaffolding, Cortex function selection.
**언제 X**: production query 매 직접 실행 — 매 EXPLAIN + governance review.
## ❌ 안티패턴
- **Always-on warehouse**: AUTO_SUSPEND 미설정 → cost 폭발.
- **SELECT * on wide table**: columnar 의 이점 매 손실.
- **One huge warehouse**: workload isolation X — ETL 매 BI 매 contend.
- **No clustering on huge table**: prune 매 작동 X — full scan.
- **Copy data instead of Data Share**: governance · cost penalty.
## 🧪 검증 / 중복
- Verified (Snowflake docs 2026; Dageville et al. SIGMOD 2016; *Snowflake: The Definitive Guide* 2nd ed).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — full content (architecture + 9 patterns) |