d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
8.3 KiB
8.3 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-data-augmentation | Data Augmentation Strategies | 10_Wiki/Topics | verified | self |
|
none | A | 0.93 | applied |
|
2026-05-10 | pending |
|
Data Augmentation
매 한 줄
"매 data 의 양 의 X — 매 모습 의 다양화". 매 invariance 의 학습 + 매 overfit 의 방지. 매 vision: rotation, flip, crop, MixUp, CutMix, AutoAugment. 매 NLP: back-translation, 매 LLM-aided. 매 modern: 매 generative augmentation (Stable Diffusion).
매 핵심 strategy
Computer Vision
- Geometric: rotation, flip, crop, scale.
- Color: brightness, contrast, hue, saturation.
- Noise: Gaussian, salt-pepper.
- MixUp: 매 two image 의 linear combine.
- CutMix: 매 patch swap.
- Cutout: 매 random masking.
- AutoAugment / RandAugment: 매 learned policy.
- TrivialAugment: 매 random + 매 simple.
- Mosaic (YOLOv5+): 매 4 image 의 grid.
NLP
- Synonym replacement (SR).
- Random insertion / deletion / swap (EDA).
- Back-translation: en → fr → en.
- Paraphrase: 매 LLM 의 generate.
- Token noise: 매 BERT-MLM-style.
- Mixup-NLP: 매 hidden representation mix.
Audio
- Speed / pitch shift.
- SpecAugment: 매 time + 매 frequency mask.
- Noise injection.
- Reverb / EQ.
- Mixup.
Tabular
- SMOTE: 매 minority class 의 synthetic.
- Feature noise (Gaussian).
- Mixup-tabular.
Modern (Generative)
- Diffusion-based augmentation: Stable Diffusion 의 generate.
- GAN-based.
- LLM-aided text: 매 paraphrase / extend.
- Domain randomization (sim → real).
매 task-specific
- Detection: bbox-aware augment (Albumentations).
- Segmentation: mask-aware.
- Pose: keypoint-aware.
- OCR: distortion + perspective.
매 modern best practice
- Strong but realistic: 매 over-augmented X.
- Test-time augmentation (TTA): 매 inference 의 multiple view.
- AutoML for augmentation: 매 task-specific policy.
- Curriculum: 매 weak → strong.
- Domain awareness: 매 vertical / horizontal flip 의 task 에 따른.
💻 패턴
torchvision (vision)
import torch
from torchvision import transforms
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(0.4, 0.4, 0.4),
transforms.RandomRotation(15),
transforms.RandAugment(num_ops=2, magnitude=9), # 매 modern default
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
Albumentations (detection / segmentation)
import albumentations as A
from albumentations.pytorch import ToTensorV2
transform = A.Compose([
A.RandomResizedCrop(224, 224, scale=(0.8, 1.0)),
A.HorizontalFlip(p=0.5),
A.OneOf([
A.GaussianBlur(),
A.MotionBlur(),
], p=0.3),
A.RandomBrightnessContrast(p=0.5),
A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
ToTensorV2(),
], bbox_params=A.BboxParams(format='pascal_voc', label_fields=['class_labels']))
# 매 bbox + image 의 동시 transform.
MixUp (loss-level)
def mixup_data(x, y, alpha=0.2):
lam = np.random.beta(alpha, alpha)
idx = torch.randperm(x.size(0))
mixed_x = lam * x + (1 - lam) * x[idx]
return mixed_x, y, y[idx], lam
def mixup_loss(criterion, pred, y_a, y_b, lam):
return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)
# 매 train loop
for x, y in loader:
mixed_x, y_a, y_b, lam = mixup_data(x, y)
pred = model(mixed_x)
loss = mixup_loss(F.cross_entropy, pred, y_a, y_b, lam)
loss.backward()
CutMix
def cutmix(x, y, alpha=1.0):
lam = np.random.beta(alpha, alpha)
idx = torch.randperm(x.size(0))
H, W = x.size(2), x.size(3)
cut_w = int(W * (1 - lam) ** 0.5)
cut_h = int(H * (1 - lam) ** 0.5)
cx, cy = np.random.randint(W), np.random.randint(H)
x1, y1 = max(cx - cut_w // 2, 0), max(cy - cut_h // 2, 0)
x2, y2 = min(cx + cut_w // 2, W), min(cy + cut_h // 2, H)
x[:, :, y1:y2, x1:x2] = x[idx, :, y1:y2, x1:x2]
lam = 1 - ((x2 - x1) * (y2 - y1) / (W * H))
return x, y, y[idx], lam
NLP — back-translation
from transformers import pipeline
en_to_fr = pipeline('translation', model='Helsinki-NLP/opus-mt-en-fr')
fr_to_en = pipeline('translation', model='Helsinki-NLP/opus-mt-fr-en')
def back_translate(text):
fr = en_to_fr(text, max_length=512)[0]['translation_text']
return fr_to_en(fr, max_length=512)[0]['translation_text']
# 매 paraphrase 효과
augmented = back_translate(original)
nlpaug (NLP utility)
import nlpaug.augmenter.word as naw
# 매 contextual (BERT-based) synonym
aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action='substitute')
augmented = aug.augment('The quick brown fox jumps')
LLM-aided augmentation
def llm_paraphrase(text, n=5):
prompt = f"""Paraphrase the following sentence in {n} different ways while preserving meaning:
Original: {text}
Output {n} paraphrases, each on a new line."""
return llm.generate(prompt).split('\n')
Audio (SpecAugment)
import torchaudio.transforms as T
aug = torch.nn.Sequential(
T.FrequencyMasking(freq_mask_param=30),
T.TimeMasking(time_mask_param=80),
)
mel_spec_augmented = aug(mel_spec)
SMOTE (tabular)
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
Diffusion-based augmentation
from diffusers import StableDiffusionImg2ImgPipeline
pipe = StableDiffusionImg2ImgPipeline.from_pretrained('runwayml/stable-diffusion-v1-5').to('cuda')
# 매 original image + 매 prompt 의 variation
augmented = pipe(
prompt='a {class_name} in different lighting / angle',
image=original_image,
strength=0.3, # 매 small change
num_inference_steps=20,
).images[0]
Test-time augmentation (TTA)
def tta_predict(model, image, n=5):
"""매 매 prediction 의 augment + 매 average."""
augments = [normal_transform, flip_transform, crop1_transform, ...]
preds = [model(aug(image)) for aug in augments[:n]]
return torch.stack(preds).mean(dim=0)
매 결정 기준
| 상황 | Strategy |
|---|---|
| Image classification | RandAugment + MixUp |
| Detection | Albumentations + Mosaic |
| Segmentation | Mask-aware augment |
| NLP | Back-translation + LLM paraphrase |
| Audio | SpecAugment |
| Imbalanced tabular | SMOTE |
| Long-tail vision | Class-balanced augment |
| Generative augment | Diffusion (img2img) |
기본값: RandAugment / TrivialAugment + MixUp/CutMix (vision). LLM paraphrase (NLP).
🔗 Graph
- 부모: Data-Engineering · L1-and-L2-Regularization
- 변형: MixUp · CutMix · AutoAugment · Back-Translation · SMOTE
- Adjacent: Bias vs Variance Trade-off · Cross-Entropy Loss · CV_Synthesis · Antifragility
🤖 LLM 활용
언제: 매 ML training. 매 small dataset. 매 imbalanced. 매 robustness 필요. 언제 X: 매 already strong model + abundant data.
❌ 안티패턴
- Test set 의 augment: 매 leakage.
- Over-augment (training + test 의 distribute mismatch).
- Wrong domain augmentation (e.g., flipping a "B" → "ⳝ" wrong text).
- No bbox-aware (detection): 매 wrong label.
- MixUp 의 label 의 hard target 의 keep: 매 wrong loss.
- Generative augment 의 OOD: 매 noise.
🧪 검증 / 중복
- Verified (Cubuk AutoAugment, Zhang MixUp, DeVries CutOut, SpecAugment).
- 신뢰도 A.
- Related: Bias vs Variance Trade-off · Cross-Entropy Loss · CV_Synthesis · Computer_Vision · Antifragility.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — strategy + 매 torchvision / Albumentations / MixUp / back-translate / TTA code |