Files
2nd/10_Wiki/Topics/Coding/CS_MapReduce_Patterns.md
T
2026-05-09 22:47:42 +09:00

7.7 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
cs-mapreduce-patterns MapReduce / Distributed Compute — Spark / DuckDB / Beam Coding draft B conceptual 2026-05-09 2026-05-09
cs
mapreduce
distributed
vibe-coding
language applicable_to
Python / SQL
Backend
Data
MapReduce
Spark
Beam
dataflow
shuffle
partitioning
distributed compute

MapReduce / Distributed Compute

큰 data set 을 여러 worker 로 분산. Map (변환) → Shuffle (재분배) → Reduce (집계). 모던: Spark / Beam / DuckDB / DataFusion. SQL 가 더 간단할 때가 많음.

📖 핵심 개념

  • Map: 입력 → key-value pairs.
  • Shuffle: key 별 grouping.
  • Reduce: key 당 집계.
  • Skew: hot key 가 worker 하나만 쥐어주면 느려짐.

💻 코드 패턴

Conceptual MapReduce

입력: ['the cat sat', 'the dog ran']

Map → [('the', 1), ('cat', 1), ('sat', 1), ('the', 1), ('dog', 1), ('ran', 1)]

Shuffle → {'the': [1,1], 'cat': [1], 'sat': [1], 'dog': [1], 'ran': [1]}

Reduce → {'the': 2, 'cat': 1, 'sat': 1, 'dog': 1, 'ran': 1}

Spark (PySpark)

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('wc').getOrCreate()

# RDD 식 (low-level)
rdd = spark.sparkContext.textFile('s3://bucket/logs/*.txt')
counts = (
    rdd
    .flatMap(lambda line: line.split())
    .map(lambda w: (w, 1))
    .reduceByKey(lambda a, b: a + b)
)
counts.saveAsTextFile('s3://bucket/output/')

# DataFrame (high-level, 권장)
df = spark.read.text('s3://bucket/logs/*.txt')
counts = (
    df.selectExpr('explode(split(value, " ")) as word')
      .groupBy('word').count()
)
counts.write.parquet('s3://bucket/output/')

Spark SQL (가장 간단)

df.createOrReplaceTempView('logs')
spark.sql("""
SELECT word, COUNT(*) as cnt
FROM logs LATERAL VIEW explode(split(value, ' ')) t AS word
GROUP BY word
ORDER BY cnt DESC
LIMIT 100
""").show()

DuckDB (single-node, large)

import duckdb
con = duckdb.connect()

# Parquet 직접 query
result = con.execute("""
SELECT date, region, SUM(amount) as total
FROM 's3://bucket/sales/*.parquet'
GROUP BY date, region
ORDER BY total DESC
""").fetchdf()

→ TB 까지 single node OK. Spark 보다 simple, 빠름.

Apache Beam (portable, runners)

import apache_beam as beam

with beam.Pipeline() as p:
    (p
     | 'Read' >> beam.io.ReadFromText('gs://bucket/*.txt')
     | 'Split' >> beam.FlatMap(lambda line: line.split())
     | 'Pair' >> beam.Map(lambda w: (w, 1))
     | 'Group' >> beam.CombinePerKey(sum)
     | 'Write' >> beam.io.WriteToText('gs://bucket/out')
    )

→ Beam = code + runner (Dataflow, Flink, Spark) 분리.

Partitioning (parallelism)

# Spark
df.repartition(200, 'date')   # 200 partition by date
df.coalesce(10)                # 줄임 (no shuffle)

# 큰 partition 적게 vs 작은 많이?
# Aim: ~128 MB / partition (Spark default)

Shuffle (가장 비싼 operation)

GroupBy / Join / Distinct = shuffle.

Tips:
- Pre-aggregate before shuffle (combiner)
- Broadcast join (작은 table 모든 worker)
- Bucket-aligned tables (sort-merge join, no shuffle)
# Spark broadcast join
from pyspark.sql.functions import broadcast

big.join(broadcast(small), 'key')   # small 가 모든 worker 로 복제

Skew (불균형)

key 'X' 가 90% rows = worker 하나만 일.

해결:
1. Salting: key + random suffix → 분산
2. Skew join hint
3. 작은 키 따로 처리

# Spark 3
df.hint('skew', 'user_id')
# Salting 예
import random
df = df.withColumn('salt', (rand() * 10).cast('int'))
df.groupBy('key', 'salt').agg(...)
  .groupBy('key').agg(...)  # 다시 합치기

File format

- Parquet: columnar, compress, predicate push-down (default)
- ORC: 비슷
- Avro: row-based + schema (Kafka)
- CSV: 텍스트 — 큰 data 비효율
- JSON: 큰 → 비효율

→ Analytics = Parquet 거의 항상.

Predicate pushdown

-- DuckDB / Spark
SELECT * FROM 's3://b/*.parquet'
WHERE date = '2026-05-09'  -- 파일 metadata 로 skip
  AND region = 'US'         -- column scan

→ 안 읽음. 빠름.

Iceberg / Delta / Hudi (table format)

# Apache Iceberg
spark.sql("""
CREATE TABLE catalog.db.events (
    id bigint, ts timestamp, payload string
) USING iceberg
PARTITIONED BY (days(ts))
""")

# Time travel
spark.read.option('snapshot-id', '12345').table('catalog.db.events')

# Schema evolution
spark.sql('ALTER TABLE catalog.db.events ADD COLUMN region string')

→ Parquet 위 ACID + version + schema 진화.

Ray (modern alternative)

import ray

@ray.remote
def process(chunk):
    return [x * 2 for x in chunk]

ray.init()
data = list(range(10_000))
chunks = [data[i:i+1000] for i in range(0, len(data), 1000)]

futures = [process.remote(c) for c in chunks]
results = ray.get(futures)

→ Spark 보다 일반 Python 친화. ML pipeline 에 강함.

Polars (single-node, modern)

import polars as pl

df = pl.scan_parquet('s3://bucket/*.parquet')
result = (
    df
    .filter(pl.col('date') == '2026-05-09')
    .group_by('user_id')
    .agg(pl.col('amount').sum())
    .collect()  # lazy → eager
)

→ Pandas 보다 10x 빠름 (Rust + Arrow).

Dataflow patterns

- Batch: 큰 데이터, 한 번에 처리 (nightly job)
- Streaming: 실시간 (click events, IoT)
- Windowing: streaming → batch-like (1 분 window)
- Watermark: late event 처리 시점

Beam streaming

(p
 | beam.io.ReadFromKafka(...)
 | beam.WindowInto(beam.window.FixedWindows(60))   # 1 min
 | beam.GroupByKey()
 | beam.io.WriteToBigQuery(...)
)

dbt (SQL-based ETL)

-- models/daily_revenue.sql
{{ config(materialized='incremental') }}

SELECT date, SUM(amount) as revenue
FROM {{ ref('orders') }}
{% if is_incremental() %}
WHERE date > (SELECT MAX(date) FROM {{ this }})
{% endif %}
GROUP BY date

→ Spark / Python 안 써도 됨. SQL → DAG.

ETL vs ELT

ETL (옛): Extract → Transform → Load to warehouse.
ELT (현): Extract → Load (raw) → Transform in warehouse.

ELT = warehouse 가 SQL 강하니 거기서 변환. dbt + Snowflake / BigQuery / DuckDB 가 default.

Job orchestration

- Airflow: 가장 인기, 무거움
- Dagster: 모던, asset-aware
- Prefect: 모던, simple
- Argo Workflows: K8s-native
- Temporal: workflow + business logic
- Cron: 작은 job

Cost

Spark on EMR: 큰 cluster $ — TB 안 넘으면 과해.
DuckDB on single VM: TB 까지 OK $$.
BigQuery: pay per GB scanned $$$.
Snowflake: pay per second compute $$$.

When NOT 분산

< 100 GB: Pandas / Polars / DuckDB (single node).
100 GB - 10 TB: DuckDB / Spark on 1 node.
10 TB+: Spark / BigQuery / Snowflake cluster.

→ "Big data is dead" — 대부분 single node 로 충분.

🤔 의사결정 기준

Size 추천
< 1 GB Pandas / Polars
1-100 GB Polars / DuckDB
100 GB - 10 TB DuckDB on big VM
> 10 TB Spark / BigQuery
Streaming Beam / Flink / Materialize
ML pipeline Ray
SQL preferable dbt + warehouse

안티패턴

  • 모든 거 Spark: 작은 dataset 도 Spark = 느림 + 비싼.
  • CSV in production: parquet 가 10x 빠름.
  • Repartition 너무 많이: shuffle 비싼.
  • Skew 무시: 1 worker 가 다 함.
  • Broadcast 큰 table: OOM.
  • Local file: HDFS / S3 / GCS.
  • dbt 없이 SQL 흩어짐: 종속성 안 보임.

🤖 LLM 활용 힌트

  • Map / Shuffle / Reduce 의 cost 인지.
  • DuckDB / Polars 가 modern (single node 만으로 충분).
  • Parquet + S3 표준.
  • dbt 가 SQL workflow 답.

🔗 관련 문서