Files
2nd/10_Wiki/Topics/AI_and_ML/Pareto-Principle.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

168 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-pareto-principle
title: Pareto Principle
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [80/20 Rule, Pareto Distribution, Power Law]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [pareto, 80-20, prioritization, decision-making]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: pandas, numpy
---
# Pareto Principle
## 매 한 줄
> **"매 80% of effects from 20% of causes"**. Vilfredo Pareto (1896) — 매 Italy land ownership 의 observation. 매 modern application: bug triage (top 20% bugs cause 80% crashes), customer revenue (top 20% pay 80%), feature importance (top 20% features carry 80% of model signal). 매 prioritization heuristic 의 default.
## 매 핵심
### 매 origin
- Pareto 1896: 매 80% of Italian land owned by 20% of population.
- Juran 1940s: 매 quality control — "vital few vs trivial many".
- 매 Power Law family — log-log linear distribution.
- 매 80/20 의 mnemonic 일 뿐 — 매 actual ratios vary (90/10, 70/30 등).
### 매 핵심 insight
- Effects are NOT uniformly distributed across causes.
- Sorting by impact 의 long tail 발견.
- ROI: 매 fix top 20% causes → solve 80% of problem with 20% of effort.
- Caveat: 매 remaining 20% of effects 매 important 일 수 있음 (safety, compliance).
### 매 software / ML context
- **Bug triage**: 매 small set of bugs causes most crashes.
- **Performance hotspots**: 매 5% of code = 95% of CPU time.
- **Feature importance**: 매 top features dominate model signal.
- **Customer revenue**: 매 enterprise tail tiny number of users.
- **Test coverage**: 매 80% of bugs in 20% of code paths.
### 매 응용
1. Backlog prioritization (impact × ease).
2. Performance profiling (optimize hot path first).
3. Feature engineering (drop low-importance features).
4. Customer success (focus on high-value accounts).
5. Bug fixing (top crash signatures first).
## 💻 패턴
### Pareto chart for bug triage
```python
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"bug": ["A","B","C","D","E","F"],
"occurrences": [500, 300, 100, 50, 30, 20]})
df = df.sort_values("occurrences", ascending=False)
df["cum_pct"] = df["occurrences"].cumsum() / df["occurrences"].sum() * 100
fig, ax1 = plt.subplots()
ax1.bar(df["bug"], df["occurrences"])
ax2 = ax1.twinx()
ax2.plot(df["bug"], df["cum_pct"], "r-o")
ax2.axhline(80, color="gray", linestyle="--")
plt.show()
```
### Find the "vital 20%"
```python
def vital_few(values, threshold=0.8):
sorted_vals = sorted(values, reverse=True)
cumsum = 0
total = sum(sorted_vals)
for i, v in enumerate(sorted_vals, 1):
cumsum += v
if cumsum / total >= threshold:
return i, sorted_vals[:i]
return len(values), sorted_vals
```
### Feature importance pruning
```python
import xgboost as xgb
model = xgb.XGBClassifier().fit(X, y)
importance = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)
cumulative = importance.cumsum() / importance.sum()
top_features = importance[cumulative <= 0.8].index # vital few
print(f"{len(top_features)} of {len(X.columns)} features carry 80% importance")
```
### Revenue concentration analysis
```python
customers = pd.read_csv("customers.csv").sort_values("revenue", ascending=False)
customers["cum_revenue_pct"] = customers["revenue"].cumsum() / customers["revenue"].sum()
top_20 = customers.head(int(len(customers) * 0.2))
print(f"Top 20% generate {top_20['revenue'].sum() / customers['revenue'].sum():.0%}")
```
### Profiling hot path (Python)
```python
import cProfile, pstats
profiler = cProfile.Profile()
profiler.enable()
run_workload()
profiler.disable()
stats = pstats.Stats(profiler).sort_stats("cumulative")
stats.print_stats(20) # top 20 functions usually = 80%+ time
```
### LLM cost: top tokens
```python
# 매 prompt token spend tracking
from collections import Counter
spend = Counter()
for log in logs:
spend[log["prompt_template"]] += log["tokens"] * log["cost_per_token"]
total = sum(spend.values())
running = 0
for template, cost in spend.most_common():
running += cost
print(f"{template}: ${cost:.2f}, cumulative {running/total:.0%}")
if running/total > 0.8: break
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Backlog overload | Pareto-rank by impact, ship top 20% |
| Slow application | Profile, fix hot path 먼저 |
| Too many features | Importance-based pruning |
| Customer support | Tier by revenue, allocate AE coverage |
| Long bug list | Triage by frequency × severity |
| Compliance / safety | Pareto NOT applicable (매 100% 필수) |
**기본값**: 매 sort by impact, take top until cumulative ≥ 80%.
## 🔗 Graph
- 부모: [[Power-Law]]
- 변형: [[80-20-Rule]] · [[Long-Tail]]
- 응용: [[Feature-Importance]]
## 🤖 LLM 활용
**언제**: 매 backlog prioritization, optimization scope, feature selection, customer segmentation.
**언제 X**: 매 safety-critical / compliance — long tail 매 ignore 불가.
## ❌ 안티패턴
- **Treating 80/20 literally**: 매 actual ratio varies — measure, don't assume.
- **Ignoring long tail entirely**: 매 some long-tail items high-leverage (zero-day, churn-risk customer).
- **Cause/effect confusion**: 매 20% of features cause 80% of accuracy ≠ keep only those (interactions matter).
- **Static analysis**: 매 Pareto re-ranks over time — 매 weekly recompute.
- **Pareto in safety domain**: 매 medical, finance, security — 매 100% coverage 필수.
## 🧪 검증 / 중복
- Verified (Pareto 1896 Cours d'économie politique, Juran 1951 Quality Handbook).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Pareto applications, charts, anti-patterns |