f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.6 KiB
4.6 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-mapreduce | MapReduce | 10_Wiki/Topics | verified | self |
|
none | A | 0.9 | applied |
|
2026-05-10 | pending |
|
MapReduce
매 한 줄
"매 split → map → shuffle → reduce". MapReduce (Dean & Ghemawat, Google 2004) 는 대규모 batch 처리 의 functional programming 모델. 2026 perspective 에서 raw Hadoop MR 은 legacy, Spark / Flink / BigQuery / Beam 이 후속 표준.
매 핵심
매 4 phase
- Split: input → fixed-size shards (HDFS block 64-128MB).
- Map: (k1, v1) → list[(k2, v2)]. Stateless, parallelizable.
- Shuffle/Sort: same k2 grouped to same reducer.
- Reduce: (k2, list[v2]) → list[(k3, v3)].
매 design principles
- Data locality: code → data, not data → code.
- Fault tolerance: re-execute failed tasks (idempotent map/reduce).
- Speculative execution: slow tasks 의 backup copy.
- Immutable inputs: re-runnable.
매 응용
- Log analysis / web indexing (original use case).
- ETL pipelines.
- ML feature aggregation.
- Data warehouse build.
💻 패턴
Word count (canonical)
from collections import defaultdict
from itertools import groupby
def map_phase(doc_id, text):
for word in text.split():
yield (word.lower(), 1)
def reduce_phase(word, counts):
yield (word, sum(counts))
def mapreduce(docs):
# Map
pairs = [kv for did, t in docs for kv in map_phase(did, t)]
# Shuffle
pairs.sort(key=lambda x: x[0])
grouped = {k: [v for _, v in g] for k, g in groupby(pairs, key=lambda x: x[0])}
# Reduce
return dict(kv for k, vs in grouped.items() for kv in reduce_phase(k, vs))
Combiner (local reduce)
def map_with_combiner(doc_id, text):
local = defaultdict(int)
for word in text.split():
local[word.lower()] += 1
for w, c in local.items():
yield (w, c)
# 매 network shuffle 양 감소
Spark RDD equivalent
from pyspark import SparkContext
sc = SparkContext()
counts = (sc.textFile("hdfs:///logs/*.txt")
.flatMap(lambda line: line.split())
.map(lambda w: (w.lower(), 1))
.reduceByKey(lambda a, b: a + b))
counts.saveAsTextFile("hdfs:///out/wc")
Inverted index
def map_idx(doc_id, text):
for word in set(text.split()):
yield (word.lower(), doc_id)
def reduce_idx(word, doc_ids):
yield (word, sorted(set(doc_ids)))
Secondary sort
# Composite key for sort-within-group
def map_temp(line):
parts = line.split(",")
year, temp = parts[0], int(parts[1])
yield ((year, temp), None) # negative temp for desc
def partitioner(key):
return hash(key[0]) % num_reducers # group by year only
def grouping_comparator(a, b):
return (a[0] > b[0]) - (a[0] < b[0]) # year only
Join (reduce-side)
def map_users(row):
yield (row["user_id"], ("user", row))
def map_orders(row):
yield (row["user_id"], ("order", row))
def reduce_join(uid, tagged):
user = next(r for tag, r in tagged if tag == "user")
for tag, r in tagged:
if tag == "order":
yield {**user, **r}
매 결정 기준
| 상황 | Approach |
|---|---|
| Batch ETL on TB+ | Spark (Hadoop MR 은 legacy) |
| Streaming | Flink / Spark Structured Streaming |
| SQL-shaped query | BigQuery / Athena / Presto |
| Cross-cloud portability | Apache Beam |
| Educational | Raw MR pseudocode |
기본값: Spark for new projects; Hadoop MR 은 legacy 유지보수만.
🔗 Graph
- 부모: Distributed Systems · Parallel-Computing
- 변형: Spark · Apache Flink
- 응용: Data Pipeline · ETL
🤖 LLM 활용
언제: pipeline design review, Spark migration 가이드, query optimization. 언제 X: real-time low-latency — wrong paradigm.
❌ 안티패턴
- Many small files: HDFS namenode 폭발. 매 compaction 필수.
- Skewed keys: 한 reducer 가 hotspot — salting / combiner 로 완화.
- Stateful map: 매 idempotency 깨짐 → fault recovery 실패.
- Re-implementing SQL: 매 BigQuery / Spark SQL 사용.
🧪 검증 / 중복
- Verified (Dean & Ghemawat OSDI 2004, Spark NSDI 2012, Hadoop docs 3.x).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — word-count + Spark + secondary-sort + join 패턴 |