Files

T

Antigravity Agent 504fd5fb42 [G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00

4.7 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

MapReduce

매 한 줄

"매 split → map → shuffle → reduce". MapReduce (Dean & Ghemawat, Google 2004) 는 대규모 batch 처리 의 functional programming 모델. 2026 perspective 에서 raw Hadoop MR 은 legacy, Spark / Flink / BigQuery / Beam 이 후속 표준.

매 핵심

매 4 phase

Split: input → fixed-size shards (HDFS block 64-128MB).
Map: (k1, v1) → list[(k2, v2)]. Stateless, parallelizable.
Shuffle/Sort: same k2 grouped to same reducer.
Reduce: (k2, list[v2]) → list[(k3, v3)].

매 design principles

Data locality: code → data, not data → code.
Fault tolerance: re-execute failed tasks (idempotent map/reduce).
Speculative execution: slow tasks 의 backup copy.
Immutable inputs: re-runnable.

매 응용

Log analysis / web indexing (original use case).
ETL pipelines.
ML feature aggregation.
Data warehouse build.

💻 패턴

Word count (canonical)

from collections import defaultdict
from itertools import groupby

def map_phase(doc_id, text):
    for word in text.split():
        yield (word.lower(), 1)

def reduce_phase(word, counts):
    yield (word, sum(counts))

def mapreduce(docs):
    # Map
    pairs = [kv for did, t in docs for kv in map_phase(did, t)]
    # Shuffle
    pairs.sort(key=lambda x: x[0])
    grouped = {k: [v for _, v in g] for k, g in groupby(pairs, key=lambda x: x[0])}
    # Reduce
    return dict(kv for k, vs in grouped.items() for kv in reduce_phase(k, vs))

Combiner (local reduce)

def map_with_combiner(doc_id, text):
    local = defaultdict(int)
    for word in text.split():
        local[word.lower()] += 1
    for w, c in local.items():
        yield (w, c)
# 매 network shuffle 양 감소

Spark RDD equivalent

from pyspark import SparkContext
sc = SparkContext()

counts = (sc.textFile("hdfs:///logs/*.txt")
            .flatMap(lambda line: line.split())
            .map(lambda w: (w.lower(), 1))
            .reduceByKey(lambda a, b: a + b))
counts.saveAsTextFile("hdfs:///out/wc")

Inverted index

def map_idx(doc_id, text):
    for word in set(text.split()):
        yield (word.lower(), doc_id)

def reduce_idx(word, doc_ids):
    yield (word, sorted(set(doc_ids)))

Secondary sort

# Composite key for sort-within-group
def map_temp(line):
    parts = line.split(",")
    year, temp = parts[0], int(parts[1])
    yield ((year, temp), None)  # negative temp for desc

def partitioner(key):
    return hash(key[0]) % num_reducers  # group by year only

def grouping_comparator(a, b):
    return (a[0] > b[0]) - (a[0] < b[0])  # year only

Join (reduce-side)

def map_users(row):
    yield (row["user_id"], ("user", row))

def map_orders(row):
    yield (row["user_id"], ("order", row))

def reduce_join(uid, tagged):
    user = next(r for tag, r in tagged if tag == "user")
    for tag, r in tagged:
        if tag == "order":
            yield {**user, **r}

매 결정 기준

상황	Approach
Batch ETL on TB+	Spark (Hadoop MR 은 legacy)
Streaming	Flink / Spark Structured Streaming
SQL-shaped query	BigQuery / Athena / Presto
Cross-cloud portability	Apache Beam
Educational	Raw MR pseudocode

기본값: Spark for new projects; Hadoop MR 은 legacy 유지보수만.

🔗 Graph

부모: Distributed Systems · Parallel-Computing
변형: Spark · Apache Flink · Apache Beam
응용: Data Pipeline · ETL
Adjacent: HDFS · Hadoop YARN

🤖 LLM 활용

언제: pipeline design review, Spark migration 가이드, query optimization. 언제 X: real-time low-latency — wrong paradigm.

❌ 안티패턴

Many small files: HDFS namenode 폭발. 매 compaction 필수.
Skewed keys: 한 reducer 가 hotspot — salting / combiner 로 완화.
Stateful map: 매 idempotency 깨짐 → fault recovery 실패.
Re-implementing SQL: 매 BigQuery / Spark SQL 사용.

🧪 검증 / 중복

Verified (Dean & Ghemawat OSDI 2004, Spark NSDI 2012, Hadoop docs 3.x).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — word-count + Spark + secondary-sort + join 패턴

4.7 KiB Raw Blame History