--- id: wiki-2026-0508-pareto-principle title: Pareto Principle category: 10_Wiki/Topics status: verified canonical_id: self aliases: [80/20 Rule, Pareto Distribution, Power Law] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [pareto, 80-20, prioritization, decision-making] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pandas, numpy --- # Pareto Principle ## 매 한 줄 > **"매 80% of effects from 20% of causes"**. Vilfredo Pareto (1896) — 매 Italy land ownership 의 observation. 매 modern application: bug triage (top 20% bugs cause 80% crashes), customer revenue (top 20% pay 80%), feature importance (top 20% features carry 80% of model signal). 매 prioritization heuristic 의 default. ## 매 핵심 ### 매 origin - Pareto 1896: 매 80% of Italian land owned by 20% of population. - Juran 1940s: 매 quality control — "vital few vs trivial many". - 매 Power Law family — log-log linear distribution. - 매 80/20 의 mnemonic 일 뿐 — 매 actual ratios vary (90/10, 70/30 등). ### 매 핵심 insight - Effects are NOT uniformly distributed across causes. - Sorting by impact 의 long tail 발견. - ROI: 매 fix top 20% causes → solve 80% of problem with 20% of effort. - Caveat: 매 remaining 20% of effects 매 important 일 수 있음 (safety, compliance). ### 매 software / ML context - **Bug triage**: 매 small set of bugs causes most crashes. - **Performance hotspots**: 매 5% of code = 95% of CPU time. - **Feature importance**: 매 top features dominate model signal. - **Customer revenue**: 매 enterprise tail tiny number of users. - **Test coverage**: 매 80% of bugs in 20% of code paths. ### 매 응용 1. Backlog prioritization (impact × ease). 2. Performance profiling (optimize hot path first). 3. Feature engineering (drop low-importance features). 4. Customer success (focus on high-value accounts). 5. Bug fixing (top crash signatures first). ## 💻 패턴 ### Pareto chart for bug triage ```python import pandas as pd import matplotlib.pyplot as plt df = pd.DataFrame({"bug": ["A","B","C","D","E","F"], "occurrences": [500, 300, 100, 50, 30, 20]}) df = df.sort_values("occurrences", ascending=False) df["cum_pct"] = df["occurrences"].cumsum() / df["occurrences"].sum() * 100 fig, ax1 = plt.subplots() ax1.bar(df["bug"], df["occurrences"]) ax2 = ax1.twinx() ax2.plot(df["bug"], df["cum_pct"], "r-o") ax2.axhline(80, color="gray", linestyle="--") plt.show() ``` ### Find the "vital 20%" ```python def vital_few(values, threshold=0.8): sorted_vals = sorted(values, reverse=True) cumsum = 0 total = sum(sorted_vals) for i, v in enumerate(sorted_vals, 1): cumsum += v if cumsum / total >= threshold: return i, sorted_vals[:i] return len(values), sorted_vals ``` ### Feature importance pruning ```python import xgboost as xgb model = xgb.XGBClassifier().fit(X, y) importance = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False) cumulative = importance.cumsum() / importance.sum() top_features = importance[cumulative <= 0.8].index # vital few print(f"{len(top_features)} of {len(X.columns)} features carry 80% importance") ``` ### Revenue concentration analysis ```python customers = pd.read_csv("customers.csv").sort_values("revenue", ascending=False) customers["cum_revenue_pct"] = customers["revenue"].cumsum() / customers["revenue"].sum() top_20 = customers.head(int(len(customers) * 0.2)) print(f"Top 20% generate {top_20['revenue'].sum() / customers['revenue'].sum():.0%}") ``` ### Profiling hot path (Python) ```python import cProfile, pstats profiler = cProfile.Profile() profiler.enable() run_workload() profiler.disable() stats = pstats.Stats(profiler).sort_stats("cumulative") stats.print_stats(20) # top 20 functions usually = 80%+ time ``` ### LLM cost: top tokens ```python # 매 prompt token spend tracking from collections import Counter spend = Counter() for log in logs: spend[log["prompt_template"]] += log["tokens"] * log["cost_per_token"] total = sum(spend.values()) running = 0 for template, cost in spend.most_common(): running += cost print(f"{template}: ${cost:.2f}, cumulative {running/total:.0%}") if running/total > 0.8: break ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Backlog overload | Pareto-rank by impact, ship top 20% | | Slow application | Profile, fix hot path 먼저 | | Too many features | Importance-based pruning | | Customer support | Tier by revenue, allocate AE coverage | | Long bug list | Triage by frequency × severity | | Compliance / safety | Pareto NOT applicable (매 100% 필수) | **기본값**: 매 sort by impact, take top until cumulative ≥ 80%. ## 🔗 Graph - 부모: [[Power-Law]] - 변형: [[80-20-Rule]] · [[Long-Tail]] - 응용: [[Feature-Importance]] ## 🤖 LLM 활용 **언제**: 매 backlog prioritization, optimization scope, feature selection, customer segmentation. **언제 X**: 매 safety-critical / compliance — long tail 매 ignore 불가. ## ❌ 안티패턴 - **Treating 80/20 literally**: 매 actual ratio varies — measure, don't assume. - **Ignoring long tail entirely**: 매 some long-tail items high-leverage (zero-day, churn-risk customer). - **Cause/effect confusion**: 매 20% of features cause 80% of accuracy ≠ keep only those (interactions matter). - **Static analysis**: 매 Pareto re-ranks over time — 매 weekly recompute. - **Pareto in safety domain**: 매 medical, finance, security — 매 100% coverage 필수. ## 🧪 검증 / 중복 - Verified (Pareto 1896 Cours d'économie politique, Juran 1951 Quality Handbook). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Pareto applications, charts, anti-patterns |