Files
2nd/10_Wiki/Topics/Coding/DB_ClickHouse_OLAP.md
T
2026-05-09 21:08:02 +09:00

4.9 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
db-clickhouse-olap ClickHouse — OLAP / 컬럼 / 빠른 집계 Coding draft B conceptual 2026-05-09 2026-05-09
database
clickhouse
olap
analytics
vibe-coding
language applicable_to
SQL / ClickHouse
Backend
ClickHouse
OLAP
columnar
MergeTree
materialized view
aggregating

ClickHouse

분석 / 메트릭 / 로그 = 컬럼 DB. 수십억 row 의 group by 가 초 단위. Postgres 가 못 따라옴 — analytics 만. 단 update / 작은 row 잘 못함.

📖 핵심 개념

  • Columnar: 컬럼별 저장 — group by / aggregate 빠름.
  • MergeTree: 표준 engine. 시간 정렬, 압축 자동.
  • Materialized view: 변경 stream → 미리 계산.
  • Distributed: shard 자연.

💻 코드 패턴

테이블 (MergeTree)

CREATE TABLE events (
    ts        DateTime64(3),
    event     LowCardinality(String),
    user_id   UUID,
    country   LowCardinality(String),
    revenue   Decimal64(2),
    metadata  Map(String, String)
)
ENGINE = MergeTree()
ORDER BY (event, ts, user_id)        -- sort key
PARTITION BY toYYYYMM(ts)            -- 월별 파티션
TTL ts + INTERVAL 90 DAY;            -- 90일 후 자동 drop

Insert (대량 권장)

INSERT INTO events VALUES
  (now64(3), 'page_view', generateUUIDv4(), 'KR', 0, {}),
  ...;
// HTTP interface
await fetch('http://clickhouse:8123/', {
  method: 'POST',
  body: 'INSERT INTO events FORMAT JSONEachRow\n' + 
    rows.map(r => JSON.stringify(r)).join('\n'),
});

Aggregate (이게 강점)

-- 일별 revenue
SELECT
  toDate(ts) AS day,
  sum(revenue) AS rev,
  count() AS events
FROM events
WHERE ts >= now() - INTERVAL 30 DAY
  AND event = 'purchase'
GROUP BY day
ORDER BY day;

-- 사용자 cohort
SELECT
  toMonday(min(ts)) AS cohort_week,
  count(DISTINCT user_id) AS users
FROM events
GROUP BY user_id;

→ 100M+ row 도 1초 미만.

LowCardinality

-- 적은 unique value (status, country) → 사전 인코딩 + 작은 저장
status LowCardinality(String)

Materialized view (자동 집계)

CREATE MATERIALIZED VIEW events_daily
ENGINE = SummingMergeTree()
ORDER BY (day, event)
AS
SELECT
  toDate(ts) AS day,
  event,
  count() AS cnt,
  sum(revenue) AS rev
FROM events
GROUP BY day, event;

-- INSERT 가 자동으로 events_daily 도 update

Aggregating MergeTree (uniq 같은 state)

CREATE MATERIALIZED VIEW events_daily_users
ENGINE = AggregatingMergeTree()
ORDER BY day
AS
SELECT
  toDate(ts) AS day,
  uniqState(user_id) AS users_state
FROM events
GROUP BY day;

-- 조회 시 merge
SELECT day, uniqMerge(users_state) AS users
FROM events_daily_users
GROUP BY day;

Funnel (sequenceMatch)

SELECT
  user_id,
  windowFunnel(3600)(ts,
    event = 'page_view',
    event = 'add_to_cart',
    event = 'purchase'
  ) AS step
FROM events
GROUP BY user_id;

SELECT step, count() FROM (...) GROUP BY step ORDER BY step;
-- step 0 = 안 봄, 1 = 첫 단계만, 2 = 2단계, 3 = 끝까지

Probabilistic (uniq, quantile)

SELECT
  toDate(ts) AS day,
  uniq(user_id) AS dau,                       -- HyperLogLog 근사
  uniqExact(user_id) AS dau_exact,
  quantile(0.95)(latency_ms) AS p95
FROM events
GROUP BY day;

CDC ingestion (Debezium → Kafka → ClickHouse)

CREATE TABLE events_kafka (...) 
ENGINE = Kafka()
SETTINGS
  kafka_broker_list = 'kafka:9092',
  kafka_topic_list = 'events',
  kafka_group_name = 'ch-consumer',
  kafka_format = 'JSONEachRow';

CREATE MATERIALIZED VIEW events_mv TO events
AS SELECT * FROM events_kafka;

Compress / disk 사용

ClickHouse 자동 압축 = LZ4 / ZSTD.
일반적으로 10-100x 압축 (시간 + LowCardinality).
1B rows = 10-100 GB 정도.

TTL / 만료

ALTER TABLE events MODIFY TTL ts + INTERVAL 90 DAY;
-- 90일 지난 row 자동 drop

🤔 의사결정 기준

데이터 추천
분석 / 로그 / 메트릭 ClickHouse
OLTP (transaction) Postgres / MySQL
Time-series + small TimescaleDB
Time-series + huge ClickHouse
Real-time analytics ClickHouse + Kafka
Data warehouse Snowflake / BigQuery (managed)

안티패턴

  • Row-level UPDATE: ClickHouse 가 약함. Replacement 패턴.
  • 단건 INSERT: 너무 많은 part. Batch (1000+).
  • OLTP 처럼 사용: deadlock / lock 다름. analytics 만.
  • Sort key 잘못: query 매번 풀 스캔. 자주 filter 컬럼 sort.
  • Partition 너무 잘게: 너무 많은 part. 월/주 정도.
  • JOIN 큰 table: 한 쪽 small (right) 만.
  • TTL 없음 + 무한: 디스크 폭발.

🤖 LLM 활용 힌트

  • INSERT 는 batch.
  • Sort key + partition + TTL 항상.
  • Materialized view 로 선계산.

🔗 관련 문서