--- id: wiki-2026-0508-long-tail title: Long Tail category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Long-Tail Distribution, Power Law, Pareto, Heavy Tail] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [distribution, power-law, pareto, imbalance, recommendation] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: { language: Python, framework: numpy/scipy/pandas } --- # Long Tail ## 매 한 줄 > **"매 long-tail = 적은 head + 무수한 tail"**. 80/20이 아니라 50/50: tail의 합이 head만큼 크다. ## 매 핵심 ### 매 분포 - **Power law**: P(x) ∝ x^(-α). α∈(2,3)이면 평균 유한, 분산 무한. - **Pareto**: P(X>x) = (x_m/x)^α. 부의 분포, 도시 인구. - **Zipf**: rank·frequency = const. 단어 빈도, 웹 페이지 인기. - **Lognormal**: log(X) ~ Normal. tail이 두꺼움. ### 매 비즈니스 (Anderson 2006) - 디지털 유통 비용 ↓ → tail item도 수익. Amazon, Netflix. - Head: 베스트셀러. Tail: niche. 합치면 head보다 큼. ### 매 ML 문제 - **Long-tail classification**: head class 풍부, tail class 희소 (iNaturalist, ImageNet-LT). - **Cold-start / recommendation**: tail item에 interaction 부족. - **Search/IR**: tail query (rare query)가 전체의 50%+. ### 매 대응 전략 1. **Re-sampling**: oversample tail, undersample head 2. **Re-weighting**: class-balanced loss (Cui 2019), focal loss 3. **Decoupling** (Kang 2020): representation은 instance-balanced, classifier는 class-balanced 4. **Logit adjustment**: log prior 보정 5. **Two-stage**: head pretrain → tail finetune ## 💻 패턴 ### Power law fit (powerlaw 패키지) ```python import powerlaw data = [...] fit = powerlaw.Fit(data) print(fit.alpha, fit.xmin) R, p = fit.distribution_compare("power_law", "lognormal") ``` ### Class imbalance 진단 ```python import pandas as pd counts = df["label"].value_counts() imbalance = counts.iloc[0] / counts.iloc[-1] # tail = labels with < median count tail = counts[counts < counts.median()].index ``` ### Class-balanced loss (Cui 2019) ```python import torch, torch.nn.functional as F # effective number: (1-β^n)/(1-β) beta = 0.999 eff_num = (1 - beta**counts) / (1 - beta) weights = 1.0 / eff_num weights = weights / weights.sum() * len(weights) loss = F.cross_entropy(logits, y, weight=torch.tensor(weights).float()) ``` ### Logit adjustment ```python # Menon 2021: subtract log prior at inference log_prior = torch.log(torch.tensor(class_freq / class_freq.sum())) adjusted_logits = logits - tau * log_prior pred = adjusted_logits.argmax(-1) ``` ### Resampling sampler ```python from torch.utils.data import WeightedRandomSampler sample_weights = 1.0 / counts[df["label"]].values sampler = WeightedRandomSampler(sample_weights, len(df), replacement=True) loader = DataLoader(ds, batch_size=64, sampler=sampler) ``` ### Recommendation: tail boost ```python # popularity-debiased: divide score by item popularity^gamma score_debiased = score / (item_popularity ** 0.5) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | 가벼운 imbalance (10:1) | class weights, focal loss | | 심한 imbalance (100:1+) | class-balanced loss, decoupling | | Recommendation cold-start | content features, popularity debias | | Sales / inventory | Pareto 80/20 → ABC 분석 | | Search rare query | semantic retrieval, query expansion | **기본값**: class-balanced CE → 안 되면 decoupling. ## 🔗 Graph - 부모: [[Class-Imbalance]] - 변형: [[Power-Law]], [[Pareto-Distribution]] - 응용: [[Recommendation-Systems]], [[Search-Ranking]] - Adjacent: [[Focal-Loss]], [[Sampling-Strategies]] ## 🤖 LLM 활용 **언제**: imbalance 진단, loss/sampler 선택 가이드, 비즈니스 사례. **언제 X**: 도메인별 tail 정의 (규제/매출 임계)는 도메인 전문가. ## ❌ 안티패턴 - Long-tail = imbalance라고 단순화 (분포 모양 vs class count) - Tail 무시하고 accuracy만 측정 (head에 편향) - Oversample만으로 해결 (overfit) - Pareto 80/20을 long-tail로 혼동 (정도가 다름) ## 🧪 검증 / 중복 - Verified (Anderson "The Long Tail", Cui 2019, Kang 2020 decoupling). 신뢰도 A. - 중복: 없음. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — 매 prefix, ML imbalance 전략 추가 |