2nd/10_Wiki/Topics/AI_and_ML/Long-Tail.md

---
id: wiki-2026-0508-long-tail
title: Long Tail
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Long-Tail Distribution, Power Law, Pareto, Heavy Tail]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [distribution, power-law, pareto, imbalance, recommendation]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack: { language: Python, framework: numpy/scipy/pandas }
---

# Long Tail

## 매 한 줄
> **"매 long-tail = 적은 head + 무수한 tail"**. 80/20이 아니라 50/50: tail의 합이 head만큼 크다.

## 매 핵심
### 매 분포
- **Power law**: P(x) ∝ x^(-α). α∈(2,3)이면 평균 유한, 분산 무한.
- **Pareto**: P(X>x) = (x_m/x)^α. 부의 분포, 도시 인구.
- **Zipf**: rank·frequency = const. 단어 빈도, 웹 페이지 인기.
- **Lognormal**: log(X) ~ Normal. tail이 두꺼움.

### 매 비즈니스 (Anderson 2006)
- 디지털 유통 비용 ↓ → tail item도 수익. Amazon, Netflix.
- Head: 베스트셀러. Tail: niche. 합치면 head보다 큼.

### 매 ML 문제
- **Long-tail classification**: head class 풍부, tail class 희소 (iNaturalist, ImageNet-LT).
- **Cold-start / recommendation**: tail item에 interaction 부족.
- **Search/IR**: tail query (rare query)가 전체의 50%+.

### 매 대응 전략
1. **Re-sampling**: oversample tail, undersample head
2. **Re-weighting**: class-balanced loss (Cui 2019), focal loss
3. **Decoupling** (Kang 2020): representation은 instance-balanced, classifier는 class-balanced
4. **Logit adjustment**: log prior 보정
5. **Two-stage**: head pretrain → tail finetune

## 💻 패턴
### Power law fit (powerlaw 패키지)
```python
import powerlaw
data = [...]
fit = powerlaw.Fit(data)
print(fit.alpha, fit.xmin)
R, p = fit.distribution_compare("power_law", "lognormal")
```

### Class imbalance 진단
```python
import pandas as pd
counts = df["label"].value_counts()
imbalance = counts.iloc[0] / counts.iloc[-1]
# tail = labels with < median count
tail = counts[counts < counts.median()].index
```

### Class-balanced loss (Cui 2019)
```python
import torch, torch.nn.functional as F
# effective number: (1-β^n)/(1-β)
beta = 0.999
eff_num = (1 - beta**counts) / (1 - beta)
weights = 1.0 / eff_num
weights = weights / weights.sum() * len(weights)
loss = F.cross_entropy(logits, y, weight=torch.tensor(weights).float())
```

### Logit adjustment
```python
# Menon 2021: subtract log prior at inference
log_prior = torch.log(torch.tensor(class_freq / class_freq.sum()))
adjusted_logits = logits - tau * log_prior
pred = adjusted_logits.argmax(-1)
```

### Resampling sampler
```python
from torch.utils.data import WeightedRandomSampler
sample_weights = 1.0 / counts[df["label"]].values
sampler = WeightedRandomSampler(sample_weights, len(df), replacement=True)
loader = DataLoader(ds, batch_size=64, sampler=sampler)
```

### Recommendation: tail boost
```python
# popularity-debiased: divide score by item popularity^gamma
score_debiased = score / (item_popularity ** 0.5)
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| 가벼운 imbalance (10:1) | class weights, focal loss |
| 심한 imbalance (100:1+) | class-balanced loss, decoupling |
| Recommendation cold-start | content features, popularity debias |
| Sales / inventory | Pareto 80/20 → ABC 분석 |
| Search rare query | semantic retrieval, query expansion |

**기본값**: class-balanced CE → 안 되면 decoupling.

## 🔗 Graph
- 부모: [[Class-Imbalance]]
- 변형: [[Power-Law]], [[Pareto-Distribution]]
- 응용: [[Recommendation-Systems]], [[Search-Ranking]]
- Adjacent: [[Focal-Loss]], [[Sampling-Strategies]]

## 🤖 LLM 활용
**언제**: imbalance 진단, loss/sampler 선택 가이드, 비즈니스 사례.
**언제 X**: 도메인별 tail 정의 (규제/매출 임계)는 도메인 전문가.

## ❌ 안티패턴
- Long-tail = imbalance라고 단순화 (분포 모양 vs class count)
- Tail 무시하고 accuracy만 측정 (head에 편향)
- Oversample만으로 해결 (overfit)
- Pareto 80/20을 long-tail로 혼동 (정도가 다름)

## 🧪 검증 / 중복
- Verified (Anderson "The Long Tail", Cui 2019, Kang 2020 decoupling). 신뢰도 A.
- 중복: 없음.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — 매 prefix, ML imbalance 전략 추가 |