Files
2nd/10_Wiki/Topics/AI_and_ML/Pareto-Principle.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

168 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-pareto-principle
title: Pareto Principle
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [80/20 Rule, Pareto Distribution, Power Law]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [pareto, 80-20, prioritization, decision-making]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: pandas, numpy
---
# Pareto Principle
## 매 한 줄
> **"매 80% of effects from 20% of causes"**. Vilfredo Pareto (1896) — 매 Italy land ownership 의 observation. 매 modern application: bug triage (top 20% bugs cause 80% crashes), customer revenue (top 20% pay 80%), feature importance (top 20% features carry 80% of model signal). 매 prioritization heuristic 의 default.
## 매 핵심
### 매 origin
- Pareto 1896: 매 80% of Italian land owned by 20% of population.
- Juran 1940s: 매 quality control — "vital few vs trivial many".
- 매 Power Law family — log-log linear distribution.
- 매 80/20 의 mnemonic 일 뿐 — 매 actual ratios vary (90/10, 70/30 등).
### 매 핵심 insight
- Effects are NOT uniformly distributed across causes.
- Sorting by impact 의 long tail 발견.
- ROI: 매 fix top 20% causes → solve 80% of problem with 20% of effort.
- Caveat: 매 remaining 20% of effects 매 important 일 수 있음 (safety, compliance).
### 매 software / ML context
- **Bug triage**: 매 small set of bugs causes most crashes.
- **Performance hotspots**: 매 5% of code = 95% of CPU time.
- **Feature importance**: 매 top features dominate model signal.
- **Customer revenue**: 매 enterprise tail tiny number of users.
- **Test coverage**: 매 80% of bugs in 20% of code paths.
### 매 응용
1. Backlog prioritization (impact × ease).
2. Performance profiling (optimize hot path first).
3. Feature engineering (drop low-importance features).
4. Customer success (focus on high-value accounts).
5. Bug fixing (top crash signatures first).
## 💻 패턴
### Pareto chart for bug triage
```python
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"bug": ["A","B","C","D","E","F"],
"occurrences": [500, 300, 100, 50, 30, 20]})
df = df.sort_values("occurrences", ascending=False)
df["cum_pct"] = df["occurrences"].cumsum() / df["occurrences"].sum() * 100
fig, ax1 = plt.subplots()
ax1.bar(df["bug"], df["occurrences"])
ax2 = ax1.twinx()
ax2.plot(df["bug"], df["cum_pct"], "r-o")
ax2.axhline(80, color="gray", linestyle="--")
plt.show()
```
### Find the "vital 20%"
```python
def vital_few(values, threshold=0.8):
sorted_vals = sorted(values, reverse=True)
cumsum = 0
total = sum(sorted_vals)
for i, v in enumerate(sorted_vals, 1):
cumsum += v
if cumsum / total >= threshold:
return i, sorted_vals[:i]
return len(values), sorted_vals
```
### Feature importance pruning
```python
import xgboost as xgb
model = xgb.XGBClassifier().fit(X, y)
importance = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)
cumulative = importance.cumsum() / importance.sum()
top_features = importance[cumulative <= 0.8].index # vital few
print(f"{len(top_features)} of {len(X.columns)} features carry 80% importance")
```
### Revenue concentration analysis
```python
customers = pd.read_csv("customers.csv").sort_values("revenue", ascending=False)
customers["cum_revenue_pct"] = customers["revenue"].cumsum() / customers["revenue"].sum()
top_20 = customers.head(int(len(customers) * 0.2))
print(f"Top 20% generate {top_20['revenue'].sum() / customers['revenue'].sum():.0%}")
```
### Profiling hot path (Python)
```python
import cProfile, pstats
profiler = cProfile.Profile()
profiler.enable()
run_workload()
profiler.disable()
stats = pstats.Stats(profiler).sort_stats("cumulative")
stats.print_stats(20) # top 20 functions usually = 80%+ time
```
### LLM cost: top tokens
```python
# 매 prompt token spend tracking
from collections import Counter
spend = Counter()
for log in logs:
spend[log["prompt_template"]] += log["tokens"] * log["cost_per_token"]
total = sum(spend.values())
running = 0
for template, cost in spend.most_common():
running += cost
print(f"{template}: ${cost:.2f}, cumulative {running/total:.0%}")
if running/total > 0.8: break
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Backlog overload | Pareto-rank by impact, ship top 20% |
| Slow application | Profile, fix hot path 먼저 |
| Too many features | Importance-based pruning |
| Customer support | Tier by revenue, allocate AE coverage |
| Long bug list | Triage by frequency × severity |
| Compliance / safety | Pareto NOT applicable (매 100% 필수) |
**기본값**: 매 sort by impact, take top until cumulative ≥ 80%.
## 🔗 Graph
- 부모: [[Power-Law]]
- 변형: [[80/20 Rule]] · [[Long-Tail]]
- 응용: [[Feature-Importance]]
## 🤖 LLM 활용
**언제**: 매 backlog prioritization, optimization scope, feature selection, customer segmentation.
**언제 X**: 매 safety-critical / compliance — long tail 매 ignore 불가.
## ❌ 안티패턴
- **Treating 80/20 literally**: 매 actual ratio varies — measure, don't assume.
- **Ignoring long tail entirely**: 매 some long-tail items high-leverage (zero-day, churn-risk customer).
- **Cause/effect confusion**: 매 20% of features cause 80% of accuracy ≠ keep only those (interactions matter).
- **Static analysis**: 매 Pareto re-ranks over time — 매 weekly recompute.
- **Pareto in safety domain**: 매 medical, finance, security — 매 100% coverage 필수.
## 🧪 검증 / 중복
- Verified (Pareto 1896 Cours d'économie politique, Juran 1951 Quality Handbook).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Pareto applications, charts, anti-patterns |