Files
2nd/10_Wiki/Topics/AI_and_ML/Long-Tail.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

4.3 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-long-tail Long Tail 10_Wiki/Topics verified self
Long-Tail Distribution
Power Law
Pareto
Heavy Tail
none A 0.9 applied
distribution
power-law
pareto
imbalance
recommendation
2026-05-10 pending
language framework
Python numpy/scipy/pandas

Long Tail

매 한 줄

"매 long-tail = 적은 head + 무수한 tail". 80/20이 아니라 50/50: tail의 합이 head만큼 크다.

매 핵심

매 분포

  • Power law: P(x) ∝ x^(-α). α∈(2,3)이면 평균 유한, 분산 무한.
  • Pareto: P(X>x) = (x_m/x)^α. 부의 분포, 도시 인구.
  • Zipf: rank·frequency = const. 단어 빈도, 웹 페이지 인기.
  • Lognormal: log(X) ~ Normal. tail이 두꺼움.

매 비즈니스 (Anderson 2006)

  • 디지털 유통 비용 ↓ → tail item도 수익. Amazon, Netflix.
  • Head: 베스트셀러. Tail: niche. 합치면 head보다 큼.

매 ML 문제

  • Long-tail classification: head class 풍부, tail class 희소 (iNaturalist, ImageNet-LT).
  • Cold-start / recommendation: tail item에 interaction 부족.
  • Search/IR: tail query (rare query)가 전체의 50%+.

매 대응 전략

  1. Re-sampling: oversample tail, undersample head
  2. Re-weighting: class-balanced loss (Cui 2019), focal loss
  3. Decoupling (Kang 2020): representation은 instance-balanced, classifier는 class-balanced
  4. Logit adjustment: log prior 보정
  5. Two-stage: head pretrain → tail finetune

💻 패턴

Power law fit (powerlaw 패키지)

import powerlaw
data = [...]
fit = powerlaw.Fit(data)
print(fit.alpha, fit.xmin)
R, p = fit.distribution_compare("power_law", "lognormal")

Class imbalance 진단

import pandas as pd
counts = df["label"].value_counts()
imbalance = counts.iloc[0] / counts.iloc[-1]
# tail = labels with < median count
tail = counts[counts < counts.median()].index

Class-balanced loss (Cui 2019)

import torch, torch.nn.functional as F
# effective number: (1-β^n)/(1-β)
beta = 0.999
eff_num = (1 - beta**counts) / (1 - beta)
weights = 1.0 / eff_num
weights = weights / weights.sum() * len(weights)
loss = F.cross_entropy(logits, y, weight=torch.tensor(weights).float())

Logit adjustment

# Menon 2021: subtract log prior at inference
log_prior = torch.log(torch.tensor(class_freq / class_freq.sum()))
adjusted_logits = logits - tau * log_prior
pred = adjusted_logits.argmax(-1)

Resampling sampler

from torch.utils.data import WeightedRandomSampler
sample_weights = 1.0 / counts[df["label"]].values
sampler = WeightedRandomSampler(sample_weights, len(df), replacement=True)
loader = DataLoader(ds, batch_size=64, sampler=sampler)

Recommendation: tail boost

# popularity-debiased: divide score by item popularity^gamma
score_debiased = score / (item_popularity ** 0.5)

매 결정 기준

상황 Approach
가벼운 imbalance (10:1) class weights, focal loss
심한 imbalance (100:1+) class-balanced loss, decoupling
Recommendation cold-start content features, popularity debias
Sales / inventory Pareto 80/20 → ABC 분석
Search rare query semantic retrieval, query expansion

기본값: class-balanced CE → 안 되면 decoupling.

🔗 Graph

🤖 LLM 활용

언제: imbalance 진단, loss/sampler 선택 가이드, 비즈니스 사례. 언제 X: 도메인별 tail 정의 (규제/매출 임계)는 도메인 전문가.

안티패턴

  • Long-tail = imbalance라고 단순화 (분포 모양 vs class count)
  • Tail 무시하고 accuracy만 측정 (head에 편향)
  • Oversample만으로 해결 (overfit)
  • Pareto 80/20을 long-tail로 혼동 (정도가 다름)

🧪 검증 / 중복

  • Verified (Anderson "The Long Tail", Cui 2019, Kang 2020 decoupling). 신뢰도 A.
  • 중복: 없음.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — 매 prefix, ML imbalance 전략 추가