--- id: wiki-2026-0508-mapreduce title: MapReduce category: 10_Wiki/Topics status: verified canonical_id: self aliases: [맵리듀스, Hadoop MR, Map-Reduce] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [distributed, big-data, parallel, hadoop, batch] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: hadoop-spark --- # MapReduce ## 매 한 줄 > **"매 split → map → shuffle → reduce"**. MapReduce (Dean & Ghemawat, Google 2004) 는 대규모 batch 처리 의 functional programming 모델. 2026 perspective 에서 raw Hadoop MR 은 legacy, Spark / Flink / BigQuery / Beam 이 후속 표준. ## 매 핵심 ### 매 4 phase - **Split**: input → fixed-size shards (HDFS block 64-128MB). - **Map**: (k1, v1) → list[(k2, v2)]. Stateless, parallelizable. - **Shuffle/Sort**: same k2 grouped to same reducer. - **Reduce**: (k2, list[v2]) → list[(k3, v3)]. ### 매 design principles - **Data locality**: code → data, not data → code. - **Fault tolerance**: re-execute failed tasks (idempotent map/reduce). - **Speculative execution**: slow tasks 의 backup copy. - **Immutable inputs**: re-runnable. ### 매 응용 1. Log analysis / web indexing (original use case). 2. ETL pipelines. 3. ML feature aggregation. 4. Data warehouse build. ## 💻 패턴 ### Word count (canonical) ```python from collections import defaultdict from itertools import groupby def map_phase(doc_id, text): for word in text.split(): yield (word.lower(), 1) def reduce_phase(word, counts): yield (word, sum(counts)) def mapreduce(docs): # Map pairs = [kv for did, t in docs for kv in map_phase(did, t)] # Shuffle pairs.sort(key=lambda x: x[0]) grouped = {k: [v for _, v in g] for k, g in groupby(pairs, key=lambda x: x[0])} # Reduce return dict(kv for k, vs in grouped.items() for kv in reduce_phase(k, vs)) ``` ### Combiner (local reduce) ```python def map_with_combiner(doc_id, text): local = defaultdict(int) for word in text.split(): local[word.lower()] += 1 for w, c in local.items(): yield (w, c) # 매 network shuffle 양 감소 ``` ### Spark RDD equivalent ```python from pyspark import SparkContext sc = SparkContext() counts = (sc.textFile("hdfs:///logs/*.txt") .flatMap(lambda line: line.split()) .map(lambda w: (w.lower(), 1)) .reduceByKey(lambda a, b: a + b)) counts.saveAsTextFile("hdfs:///out/wc") ``` ### Inverted index ```python def map_idx(doc_id, text): for word in set(text.split()): yield (word.lower(), doc_id) def reduce_idx(word, doc_ids): yield (word, sorted(set(doc_ids))) ``` ### Secondary sort ```python # Composite key for sort-within-group def map_temp(line): parts = line.split(",") year, temp = parts[0], int(parts[1]) yield ((year, temp), None) # negative temp for desc def partitioner(key): return hash(key[0]) % num_reducers # group by year only def grouping_comparator(a, b): return (a[0] > b[0]) - (a[0] < b[0]) # year only ``` ### Join (reduce-side) ```python def map_users(row): yield (row["user_id"], ("user", row)) def map_orders(row): yield (row["user_id"], ("order", row)) def reduce_join(uid, tagged): user = next(r for tag, r in tagged if tag == "user") for tag, r in tagged: if tag == "order": yield {**user, **r} ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Batch ETL on TB+ | Spark (Hadoop MR 은 legacy) | | Streaming | Flink / Spark Structured Streaming | | SQL-shaped query | BigQuery / Athena / Presto | | Cross-cloud portability | Apache Beam | | Educational | Raw MR pseudocode | **기본값**: Spark for new projects; Hadoop MR 은 legacy 유지보수만. ## 🔗 Graph - 부모: [[Distributed Systems]] · [[Parallel-Computing]] - 변형: [[Spark]] · [[Apache Flink]] - 응용: [[Data Pipeline]] · [[ETL]] ## 🤖 LLM 활용 **언제**: pipeline design review, Spark migration 가이드, query optimization. **언제 X**: real-time low-latency — wrong paradigm. ## ❌ 안티패턴 - **Many small files**: HDFS namenode 폭발. 매 compaction 필수. - **Skewed keys**: 한 reducer 가 hotspot — salting / combiner 로 완화. - **Stateful map**: 매 idempotency 깨짐 → fault recovery 실패. - **Re-implementing SQL**: 매 BigQuery / Spark SQL 사용. ## 🧪 검증 / 중복 - Verified (Dean & Ghemawat OSDI 2004, Spark NSDI 2012, Hadoop docs 3.x). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — word-count + Spark + secondary-sort + join 패턴 |