Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

4.3 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Long Tail

매 한 줄

"매 long-tail = 적은 head + 무수한 tail". 80/20이 아니라 50/50: tail의 합이 head만큼 크다.

매 핵심

매 분포

Power law: P(x) ∝ x^(-α). α∈(2,3)이면 평균 유한, 분산 무한.
Pareto: P(X>x) = (x_m/x)^α. 부의 분포, 도시 인구.
Zipf: rank·frequency = const. 단어 빈도, 웹 페이지 인기.
Lognormal: log(X) ~ Normal. tail이 두꺼움.

매 비즈니스 (Anderson 2006)

디지털 유통 비용 ↓ → tail item도 수익. Amazon, Netflix.
Head: 베스트셀러. Tail: niche. 합치면 head보다 큼.

매 ML 문제

Long-tail classification: head class 풍부, tail class 희소 (iNaturalist, ImageNet-LT).
Cold-start / recommendation: tail item에 interaction 부족.
Search/IR: tail query (rare query)가 전체의 50%+.

매 대응 전략

Re-sampling: oversample tail, undersample head
Re-weighting: class-balanced loss (Cui 2019), focal loss
Decoupling (Kang 2020): representation은 instance-balanced, classifier는 class-balanced
Logit adjustment: log prior 보정
Two-stage: head pretrain → tail finetune

💻 패턴

Power law fit (powerlaw 패키지)

import powerlaw
data = [...]
fit = powerlaw.Fit(data)
print(fit.alpha, fit.xmin)
R, p = fit.distribution_compare("power_law", "lognormal")

Class imbalance 진단

import pandas as pd
counts = df["label"].value_counts()
imbalance = counts.iloc[0] / counts.iloc[-1]
# tail = labels with < median count
tail = counts[counts < counts.median()].index

Class-balanced loss (Cui 2019)

import torch, torch.nn.functional as F
# effective number: (1-β^n)/(1-β)
beta = 0.999
eff_num = (1 - beta**counts) / (1 - beta)
weights = 1.0 / eff_num
weights = weights / weights.sum() * len(weights)
loss = F.cross_entropy(logits, y, weight=torch.tensor(weights).float())

Logit adjustment

# Menon 2021: subtract log prior at inference
log_prior = torch.log(torch.tensor(class_freq / class_freq.sum()))
adjusted_logits = logits - tau * log_prior
pred = adjusted_logits.argmax(-1)

Resampling sampler

from torch.utils.data import WeightedRandomSampler
sample_weights = 1.0 / counts[df["label"]].values
sampler = WeightedRandomSampler(sample_weights, len(df), replacement=True)
loader = DataLoader(ds, batch_size=64, sampler=sampler)

Recommendation: tail boost

# popularity-debiased: divide score by item popularity^gamma
score_debiased = score / (item_popularity ** 0.5)

매 결정 기준

상황	Approach
가벼운 imbalance (10:1)	class weights, focal loss
심한 imbalance (100:1+)	class-balanced loss, decoupling
Recommendation cold-start	content features, popularity debias
Sales / inventory	Pareto 80/20 → ABC 분석
Search rare query	semantic retrieval, query expansion

기본값: class-balanced CE → 안 되면 decoupling.

🔗 Graph

부모: Class-Imbalance
변형: Power-Law, Pareto-Distribution
응용: Recommendation-Systems, Search-Ranking
Adjacent: Focal-Loss, Sampling-Strategies

🤖 LLM 활용

언제: imbalance 진단, loss/sampler 선택 가이드, 비즈니스 사례. 언제 X: 도메인별 tail 정의 (규제/매출 임계)는 도메인 전문가.

❌ 안티패턴

Long-tail = imbalance라고 단순화 (분포 모양 vs class count)
Tail 무시하고 accuracy만 측정 (head에 편향)
Oversample만으로 해결 (overfit)
Pareto 80/20을 long-tail로 혼동 (정도가 다름)

🧪 검증 / 중복

Verified (Anderson "The Long Tail", Cui 2019, Kang 2020 decoupling). 신뢰도 A.
중복: 없음.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — 매 prefix, ML imbalance 전략 추가

4.3 KiB Raw Blame History Unescape Escape