Files
2nd/10_Wiki/Topics/AI_and_ML/Pareto-Principle.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

5.8 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-pareto-principle Pareto Principle 10_Wiki/Topics verified self
80/20 Rule
Pareto Distribution
Power Law
none A 0.9 applied
pareto
80-20
prioritization
decision-making
2026-05-10 pending
language framework
python pandas, numpy

Pareto Principle

매 한 줄

"매 80% of effects from 20% of causes". Vilfredo Pareto (1896) — 매 Italy land ownership 의 observation. 매 modern application: bug triage (top 20% bugs cause 80% crashes), customer revenue (top 20% pay 80%), feature importance (top 20% features carry 80% of model signal). 매 prioritization heuristic 의 default.

매 핵심

매 origin

  • Pareto 1896: 매 80% of Italian land owned by 20% of population.
  • Juran 1940s: 매 quality control — "vital few vs trivial many".
  • 매 Power Law family — log-log linear distribution.
  • 매 80/20 의 mnemonic 일 뿐 — 매 actual ratios vary (90/10, 70/30 등).

매 핵심 insight

  • Effects are NOT uniformly distributed across causes.
  • Sorting by impact 의 long tail 발견.
  • ROI: 매 fix top 20% causes → solve 80% of problem with 20% of effort.
  • Caveat: 매 remaining 20% of effects 매 important 일 수 있음 (safety, compliance).

매 software / ML context

  • Bug triage: 매 small set of bugs causes most crashes.
  • Performance hotspots: 매 5% of code = 95% of CPU time.
  • Feature importance: 매 top features dominate model signal.
  • Customer revenue: 매 enterprise tail tiny number of users.
  • Test coverage: 매 80% of bugs in 20% of code paths.

매 응용

  1. Backlog prioritization (impact × ease).
  2. Performance profiling (optimize hot path first).
  3. Feature engineering (drop low-importance features).
  4. Customer success (focus on high-value accounts).
  5. Bug fixing (top crash signatures first).

💻 패턴

Pareto chart for bug triage

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({"bug": ["A","B","C","D","E","F"],
                   "occurrences": [500, 300, 100, 50, 30, 20]})
df = df.sort_values("occurrences", ascending=False)
df["cum_pct"] = df["occurrences"].cumsum() / df["occurrences"].sum() * 100

fig, ax1 = plt.subplots()
ax1.bar(df["bug"], df["occurrences"])
ax2 = ax1.twinx()
ax2.plot(df["bug"], df["cum_pct"], "r-o")
ax2.axhline(80, color="gray", linestyle="--")
plt.show()

Find the "vital 20%"

def vital_few(values, threshold=0.8):
    sorted_vals = sorted(values, reverse=True)
    cumsum = 0
    total = sum(sorted_vals)
    for i, v in enumerate(sorted_vals, 1):
        cumsum += v
        if cumsum / total >= threshold:
            return i, sorted_vals[:i]
    return len(values), sorted_vals

Feature importance pruning

import xgboost as xgb
model = xgb.XGBClassifier().fit(X, y)
importance = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)
cumulative = importance.cumsum() / importance.sum()
top_features = importance[cumulative <= 0.8].index  # vital few
print(f"{len(top_features)} of {len(X.columns)} features carry 80% importance")

Revenue concentration analysis

customers = pd.read_csv("customers.csv").sort_values("revenue", ascending=False)
customers["cum_revenue_pct"] = customers["revenue"].cumsum() / customers["revenue"].sum()
top_20 = customers.head(int(len(customers) * 0.2))
print(f"Top 20% generate {top_20['revenue'].sum() / customers['revenue'].sum():.0%}")

Profiling hot path (Python)

import cProfile, pstats
profiler = cProfile.Profile()
profiler.enable()
run_workload()
profiler.disable()
stats = pstats.Stats(profiler).sort_stats("cumulative")
stats.print_stats(20)  # top 20 functions usually = 80%+ time

LLM cost: top tokens

# 매 prompt token spend tracking
from collections import Counter
spend = Counter()
for log in logs:
    spend[log["prompt_template"]] += log["tokens"] * log["cost_per_token"]
total = sum(spend.values())
running = 0
for template, cost in spend.most_common():
    running += cost
    print(f"{template}: ${cost:.2f}, cumulative {running/total:.0%}")
    if running/total > 0.8: break

매 결정 기준

상황 Approach
Backlog overload Pareto-rank by impact, ship top 20%
Slow application Profile, fix hot path 먼저
Too many features Importance-based pruning
Customer support Tier by revenue, allocate AE coverage
Long bug list Triage by frequency × severity
Compliance / safety Pareto NOT applicable (매 100% 필수)

기본값: 매 sort by impact, take top until cumulative ≥ 80%.

🔗 Graph

🤖 LLM 활용

언제: 매 backlog prioritization, optimization scope, feature selection, customer segmentation. 언제 X: 매 safety-critical / compliance — long tail 매 ignore 불가.

안티패턴

  • Treating 80/20 literally: 매 actual ratio varies — measure, don't assume.
  • Ignoring long tail entirely: 매 some long-tail items high-leverage (zero-day, churn-risk customer).
  • Cause/effect confusion: 매 20% of features cause 80% of accuracy ≠ keep only those (interactions matter).
  • Static analysis: 매 Pareto re-ranks over time — 매 weekly recompute.
  • Pareto in safety domain: 매 medical, finance, security — 매 100% coverage 필수.

🧪 검증 / 중복

  • Verified (Pareto 1896 Cours d'économie politique, Juran 1951 Quality Handbook).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — Pareto applications, charts, anti-patterns