Files
2nd/10_Wiki/Topics/AI_and_ML/Long-Tail.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

135 lines
4.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-long-tail
title: Long Tail
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Long-Tail Distribution, Power Law, Pareto, Heavy Tail]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [distribution, power-law, pareto, imbalance, recommendation]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack: { language: Python, framework: numpy/scipy/pandas }
---
# Long Tail
## 매 한 줄
> **"매 long-tail = 적은 head + 무수한 tail"**. 80/20이 아니라 50/50: tail의 합이 head만큼 크다.
## 매 핵심
### 매 분포
- **Power law**: P(x) ∝ x^(-α). α∈(2,3)이면 평균 유한, 분산 무한.
- **Pareto**: P(X>x) = (x_m/x)^α. 부의 분포, 도시 인구.
- **Zipf**: rank·frequency = const. 단어 빈도, 웹 페이지 인기.
- **Lognormal**: log(X) ~ Normal. tail이 두꺼움.
### 매 비즈니스 (Anderson 2006)
- 디지털 유통 비용 ↓ → tail item도 수익. Amazon, Netflix.
- Head: 베스트셀러. Tail: niche. 합치면 head보다 큼.
### 매 ML 문제
- **Long-tail classification**: head class 풍부, tail class 희소 (iNaturalist, ImageNet-LT).
- **Cold-start / recommendation**: tail item에 interaction 부족.
- **Search/IR**: tail query (rare query)가 전체의 50%+.
### 매 대응 전략
1. **Re-sampling**: oversample tail, undersample head
2. **Re-weighting**: class-balanced loss (Cui 2019), focal loss
3. **Decoupling** (Kang 2020): representation은 instance-balanced, classifier는 class-balanced
4. **Logit adjustment**: log prior 보정
5. **Two-stage**: head pretrain → tail finetune
## 💻 패턴
### Power law fit (powerlaw 패키지)
```python
import powerlaw
data = [...]
fit = powerlaw.Fit(data)
print(fit.alpha, fit.xmin)
R, p = fit.distribution_compare("power_law", "lognormal")
```
### Class imbalance 진단
```python
import pandas as pd
counts = df["label"].value_counts()
imbalance = counts.iloc[0] / counts.iloc[-1]
# tail = labels with < median count
tail = counts[counts < counts.median()].index
```
### Class-balanced loss (Cui 2019)
```python
import torch, torch.nn.functional as F
# effective number: (1-β^n)/(1-β)
beta = 0.999
eff_num = (1 - beta**counts) / (1 - beta)
weights = 1.0 / eff_num
weights = weights / weights.sum() * len(weights)
loss = F.cross_entropy(logits, y, weight=torch.tensor(weights).float())
```
### Logit adjustment
```python
# Menon 2021: subtract log prior at inference
log_prior = torch.log(torch.tensor(class_freq / class_freq.sum()))
adjusted_logits = logits - tau * log_prior
pred = adjusted_logits.argmax(-1)
```
### Resampling sampler
```python
from torch.utils.data import WeightedRandomSampler
sample_weights = 1.0 / counts[df["label"]].values
sampler = WeightedRandomSampler(sample_weights, len(df), replacement=True)
loader = DataLoader(ds, batch_size=64, sampler=sampler)
```
### Recommendation: tail boost
```python
# popularity-debiased: divide score by item popularity^gamma
score_debiased = score / (item_popularity ** 0.5)
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| 가벼운 imbalance (10:1) | class weights, focal loss |
| 심한 imbalance (100:1+) | class-balanced loss, decoupling |
| Recommendation cold-start | content features, popularity debias |
| Sales / inventory | Pareto 80/20 → ABC 분석 |
| Search rare query | semantic retrieval, query expansion |
**기본값**: class-balanced CE → 안 되면 decoupling.
## 🔗 Graph
- 부모: [[Class-Imbalance]]
- 변형: [[Power-Law]], [[Pareto-Distribution]]
- 응용: [[Recommendation-Systems]], [[Search-Ranking]]
- Adjacent: [[Focal-Loss]], [[Sampling-Strategies]]
## 🤖 LLM 활용
**언제**: imbalance 진단, loss/sampler 선택 가이드, 비즈니스 사례.
**언제 X**: 도메인별 tail 정의 (규제/매출 임계)는 도메인 전문가.
## ❌ 안티패턴
- Long-tail = imbalance라고 단순화 (분포 모양 vs class count)
- Tail 무시하고 accuracy만 측정 (head에 편향)
- Oversample만으로 해결 (overfit)
- Pareto 80/20을 long-tail로 혼동 (정도가 다름)
## 🧪 검증 / 중복
- Verified (Anderson "The Long Tail", Cui 2019, Kang 2020 decoupling). 신뢰도 A.
- 중복: 없음.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — 매 prefix, ML imbalance 전략 추가 |