Files

T

koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)

이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-08 12:24:15 +09:00

5.8 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Pareto Principle

매 한 줄

"매 80% of effects from 20% of causes". Vilfredo Pareto (1896) — 매 Italy land ownership 의 observation. 매 modern application: bug triage (top 20% bugs cause 80% crashes), customer revenue (top 20% pay 80%), feature importance (top 20% features carry 80% of model signal). 매 prioritization heuristic 의 default.

매 핵심

매 origin

Pareto 1896: 매 80% of Italian land owned by 20% of population.
Juran 1940s: 매 quality control — "vital few vs trivial many".
매 Power Law family — log-log linear distribution.
매 80/20 의 mnemonic 일 뿐 — 매 actual ratios vary (90/10, 70/30 등).

매 핵심 insight

Effects are NOT uniformly distributed across causes.
Sorting by impact 의 long tail 발견.
ROI: 매 fix top 20% causes → solve 80% of problem with 20% of effort.
Caveat: 매 remaining 20% of effects 매 important 일 수 있음 (safety, compliance).

매 software / ML context

Bug triage: 매 small set of bugs causes most crashes.
Performance hotspots: 매 5% of code = 95% of CPU time.
Feature importance: 매 top features dominate model signal.
Customer revenue: 매 enterprise tail tiny number of users.
Test coverage: 매 80% of bugs in 20% of code paths.

매 응용

Backlog prioritization (impact × ease).
Performance profiling (optimize hot path first).
Feature engineering (drop low-importance features).
Customer success (focus on high-value accounts).
Bug fixing (top crash signatures first).

💻 패턴

Pareto chart for bug triage

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({"bug": ["A","B","C","D","E","F"],
                   "occurrences": [500, 300, 100, 50, 30, 20]})
df = df.sort_values("occurrences", ascending=False)
df["cum_pct"] = df["occurrences"].cumsum() / df["occurrences"].sum() * 100

fig, ax1 = plt.subplots()
ax1.bar(df["bug"], df["occurrences"])
ax2 = ax1.twinx()
ax2.plot(df["bug"], df["cum_pct"], "r-o")
ax2.axhline(80, color="gray", linestyle="--")
plt.show()

Find the "vital 20%"

def vital_few(values, threshold=0.8):
    sorted_vals = sorted(values, reverse=True)
    cumsum = 0
    total = sum(sorted_vals)
    for i, v in enumerate(sorted_vals, 1):
        cumsum += v
        if cumsum / total >= threshold:
            return i, sorted_vals[:i]
    return len(values), sorted_vals

Feature importance pruning

import xgboost as xgb
model = xgb.XGBClassifier().fit(X, y)
importance = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)
cumulative = importance.cumsum() / importance.sum()
top_features = importance[cumulative <= 0.8].index  # vital few
print(f"{len(top_features)} of {len(X.columns)} features carry 80% importance")

Revenue concentration analysis

customers = pd.read_csv("customers.csv").sort_values("revenue", ascending=False)
customers["cum_revenue_pct"] = customers["revenue"].cumsum() / customers["revenue"].sum()
top_20 = customers.head(int(len(customers) * 0.2))
print(f"Top 20% generate {top_20['revenue'].sum() / customers['revenue'].sum():.0%}")

Profiling hot path (Python)

import cProfile, pstats
profiler = cProfile.Profile()
profiler.enable()
run_workload()
profiler.disable()
stats = pstats.Stats(profiler).sort_stats("cumulative")
stats.print_stats(20)  # top 20 functions usually = 80%+ time

LLM cost: top tokens

# 매 prompt token spend tracking
from collections import Counter
spend = Counter()
for log in logs:
    spend[log["prompt_template"]] += log["tokens"] * log["cost_per_token"]
total = sum(spend.values())
running = 0
for template, cost in spend.most_common():
    running += cost
    print(f"{template}: ${cost:.2f}, cumulative {running/total:.0%}")
    if running/total > 0.8: break

매 결정 기준

상황	Approach
Backlog overload	Pareto-rank by impact, ship top 20%
Slow application	Profile, fix hot path 먼저
Too many features	Importance-based pruning
Customer support	Tier by revenue, allocate AE coverage
Long bug list	Triage by frequency × severity
Compliance / safety	Pareto NOT applicable (매 100% 필수)

기본값: 매 sort by impact, take top until cumulative ≥ 80%.

🔗 Graph

부모: Power-Law
변형: 80/20 Rule · Long-Tail
응용: Feature-Importance

🤖 LLM 활용

언제: 매 backlog prioritization, optimization scope, feature selection, customer segmentation. 언제 X: 매 safety-critical / compliance — long tail 매 ignore 불가.

❌ 안티패턴

Treating 80/20 literally: 매 actual ratio varies — measure, don't assume.
Ignoring long tail entirely: 매 some long-tail items high-leverage (zero-day, churn-risk customer).
Cause/effect confusion: 매 20% of features cause 80% of accuracy ≠ keep only those (interactions matter).
Static analysis: 매 Pareto re-ranks over time — 매 weekly recompute.
Pareto in safety domain: 매 medical, finance, security — 매 100% coverage 필수.

🧪 검증 / 중복

Verified (Pareto 1896 Cours d'économie politique, Juran 1951 Quality Handbook).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — Pareto applications, charts, anti-patterns

5.8 KiB Raw Blame History Unescape Escape