Files
2nd/10_Wiki/Topics/Other/MapReduce.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

4.6 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-mapreduce MapReduce 10_Wiki/Topics verified self
맵리듀스
Hadoop MR
Map-Reduce
none A 0.9 applied
distributed
big-data
parallel
hadoop
batch
2026-05-10 pending
language framework
python hadoop-spark

MapReduce

매 한 줄

"매 split → map → shuffle → reduce". MapReduce (Dean & Ghemawat, Google 2004) 는 대규모 batch 처리 의 functional programming 모델. 2026 perspective 에서 raw Hadoop MR 은 legacy, Spark / Flink / BigQuery / Beam 이 후속 표준.

매 핵심

매 4 phase

  • Split: input → fixed-size shards (HDFS block 64-128MB).
  • Map: (k1, v1) → list[(k2, v2)]. Stateless, parallelizable.
  • Shuffle/Sort: same k2 grouped to same reducer.
  • Reduce: (k2, list[v2]) → list[(k3, v3)].

매 design principles

  • Data locality: code → data, not data → code.
  • Fault tolerance: re-execute failed tasks (idempotent map/reduce).
  • Speculative execution: slow tasks 의 backup copy.
  • Immutable inputs: re-runnable.

매 응용

  1. Log analysis / web indexing (original use case).
  2. ETL pipelines.
  3. ML feature aggregation.
  4. Data warehouse build.

💻 패턴

Word count (canonical)

from collections import defaultdict
from itertools import groupby

def map_phase(doc_id, text):
    for word in text.split():
        yield (word.lower(), 1)

def reduce_phase(word, counts):
    yield (word, sum(counts))

def mapreduce(docs):
    # Map
    pairs = [kv for did, t in docs for kv in map_phase(did, t)]
    # Shuffle
    pairs.sort(key=lambda x: x[0])
    grouped = {k: [v for _, v in g] for k, g in groupby(pairs, key=lambda x: x[0])}
    # Reduce
    return dict(kv for k, vs in grouped.items() for kv in reduce_phase(k, vs))

Combiner (local reduce)

def map_with_combiner(doc_id, text):
    local = defaultdict(int)
    for word in text.split():
        local[word.lower()] += 1
    for w, c in local.items():
        yield (w, c)
# 매 network shuffle 양 감소

Spark RDD equivalent

from pyspark import SparkContext
sc = SparkContext()

counts = (sc.textFile("hdfs:///logs/*.txt")
            .flatMap(lambda line: line.split())
            .map(lambda w: (w.lower(), 1))
            .reduceByKey(lambda a, b: a + b))
counts.saveAsTextFile("hdfs:///out/wc")

Inverted index

def map_idx(doc_id, text):
    for word in set(text.split()):
        yield (word.lower(), doc_id)

def reduce_idx(word, doc_ids):
    yield (word, sorted(set(doc_ids)))

Secondary sort

# Composite key for sort-within-group
def map_temp(line):
    parts = line.split(",")
    year, temp = parts[0], int(parts[1])
    yield ((year, temp), None)  # negative temp for desc

def partitioner(key):
    return hash(key[0]) % num_reducers  # group by year only

def grouping_comparator(a, b):
    return (a[0] > b[0]) - (a[0] < b[0])  # year only

Join (reduce-side)

def map_users(row):
    yield (row["user_id"], ("user", row))

def map_orders(row):
    yield (row["user_id"], ("order", row))

def reduce_join(uid, tagged):
    user = next(r for tag, r in tagged if tag == "user")
    for tag, r in tagged:
        if tag == "order":
            yield {**user, **r}

매 결정 기준

상황 Approach
Batch ETL on TB+ Spark (Hadoop MR 은 legacy)
Streaming Flink / Spark Structured Streaming
SQL-shaped query BigQuery / Athena / Presto
Cross-cloud portability Apache Beam
Educational Raw MR pseudocode

기본값: Spark for new projects; Hadoop MR 은 legacy 유지보수만.

🔗 Graph

🤖 LLM 활용

언제: pipeline design review, Spark migration 가이드, query optimization. 언제 X: real-time low-latency — wrong paradigm.

안티패턴

  • Many small files: HDFS namenode 폭발. 매 compaction 필수.
  • Skewed keys: 한 reducer 가 hotspot — salting / combiner 로 완화.
  • Stateful map: 매 idempotency 깨짐 → fault recovery 실패.
  • Re-implementing SQL: 매 BigQuery / Spark SQL 사용.

🧪 검증 / 중복

  • Verified (Dean & Ghemawat OSDI 2004, Spark NSDI 2012, Hadoop docs 3.x).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — word-count + Spark + secondary-sort + join 패턴