d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
274 lines
8.3 KiB
Markdown
274 lines
8.3 KiB
Markdown
---
|
|
id: wiki-2026-0508-data-augmentation
|
|
title: Data Augmentation Strategies
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [data augmentation, AutoAugment, RandAugment, MixUp, CutMix, back translation, Mosaic]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.93
|
|
verification_status: applied
|
|
tags: [data-augmentation, vision, nlp, audio, regularization, autoaugment, mixup, cutmix, generative-augment]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: Python
|
|
framework: torchvision / Albumentations / Augly / nlpaug / Diffusers
|
|
---
|
|
|
|
# Data Augmentation
|
|
|
|
## 매 한 줄
|
|
> **"매 data 의 양 의 X — 매 모습 의 다양화"**. 매 invariance 의 학습 + 매 overfit 의 방지. 매 vision: rotation, flip, crop, MixUp, CutMix, AutoAugment. 매 NLP: back-translation, 매 LLM-aided. 매 modern: 매 generative augmentation (Stable Diffusion).
|
|
|
|
## 매 핵심 strategy
|
|
|
|
### Computer Vision
|
|
- **Geometric**: rotation, flip, crop, scale.
|
|
- **Color**: brightness, contrast, hue, saturation.
|
|
- **Noise**: Gaussian, salt-pepper.
|
|
- **MixUp**: 매 two image 의 linear combine.
|
|
- **CutMix**: 매 patch swap.
|
|
- **Cutout**: 매 random masking.
|
|
- **AutoAugment / RandAugment**: 매 learned policy.
|
|
- **TrivialAugment**: 매 random + 매 simple.
|
|
- **Mosaic** (YOLOv5+): 매 4 image 의 grid.
|
|
|
|
### NLP
|
|
- **Synonym replacement** (SR).
|
|
- **Random insertion / deletion / swap** (EDA).
|
|
- **Back-translation**: en → fr → en.
|
|
- **Paraphrase**: 매 LLM 의 generate.
|
|
- **Token noise**: 매 BERT-MLM-style.
|
|
- **Mixup-NLP**: 매 hidden representation mix.
|
|
|
|
### Audio
|
|
- **Speed / pitch shift**.
|
|
- **SpecAugment**: 매 time + 매 frequency mask.
|
|
- **Noise injection**.
|
|
- **Reverb / EQ**.
|
|
- **Mixup**.
|
|
|
|
### Tabular
|
|
- **SMOTE**: 매 minority class 의 synthetic.
|
|
- **Feature noise** (Gaussian).
|
|
- **Mixup-tabular**.
|
|
|
|
### Modern (Generative)
|
|
- **Diffusion-based augmentation**: Stable Diffusion 의 generate.
|
|
- **GAN-based**.
|
|
- **LLM-aided text**: 매 paraphrase / extend.
|
|
- **Domain randomization** (sim → real).
|
|
|
|
### 매 task-specific
|
|
- **Detection**: bbox-aware augment (Albumentations).
|
|
- **Segmentation**: mask-aware.
|
|
- **Pose**: keypoint-aware.
|
|
- **OCR**: distortion + perspective.
|
|
|
|
### 매 modern best practice
|
|
1. **Strong but realistic**: 매 over-augmented X.
|
|
2. **Test-time augmentation** (TTA): 매 inference 의 multiple view.
|
|
3. **AutoML for augmentation**: 매 task-specific policy.
|
|
4. **Curriculum**: 매 weak → strong.
|
|
5. **Domain awareness**: 매 vertical / horizontal flip 의 task 에 따른.
|
|
|
|
## 💻 패턴
|
|
|
|
### torchvision (vision)
|
|
```python
|
|
import torch
|
|
from torchvision import transforms
|
|
|
|
train_transform = transforms.Compose([
|
|
transforms.RandomResizedCrop(224),
|
|
transforms.RandomHorizontalFlip(),
|
|
transforms.ColorJitter(0.4, 0.4, 0.4),
|
|
transforms.RandomRotation(15),
|
|
transforms.RandAugment(num_ops=2, magnitude=9), # 매 modern default
|
|
transforms.ToTensor(),
|
|
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
|
|
])
|
|
```
|
|
|
|
### Albumentations (detection / segmentation)
|
|
```python
|
|
import albumentations as A
|
|
from albumentations.pytorch import ToTensorV2
|
|
|
|
transform = A.Compose([
|
|
A.RandomResizedCrop(224, 224, scale=(0.8, 1.0)),
|
|
A.HorizontalFlip(p=0.5),
|
|
A.OneOf([
|
|
A.GaussianBlur(),
|
|
A.MotionBlur(),
|
|
], p=0.3),
|
|
A.RandomBrightnessContrast(p=0.5),
|
|
A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
|
|
ToTensorV2(),
|
|
], bbox_params=A.BboxParams(format='pascal_voc', label_fields=['class_labels']))
|
|
|
|
# 매 bbox + image 의 동시 transform.
|
|
```
|
|
|
|
### MixUp (loss-level)
|
|
```python
|
|
def mixup_data(x, y, alpha=0.2):
|
|
lam = np.random.beta(alpha, alpha)
|
|
idx = torch.randperm(x.size(0))
|
|
mixed_x = lam * x + (1 - lam) * x[idx]
|
|
return mixed_x, y, y[idx], lam
|
|
|
|
def mixup_loss(criterion, pred, y_a, y_b, lam):
|
|
return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)
|
|
|
|
# 매 train loop
|
|
for x, y in loader:
|
|
mixed_x, y_a, y_b, lam = mixup_data(x, y)
|
|
pred = model(mixed_x)
|
|
loss = mixup_loss(F.cross_entropy, pred, y_a, y_b, lam)
|
|
loss.backward()
|
|
```
|
|
|
|
### CutMix
|
|
```python
|
|
def cutmix(x, y, alpha=1.0):
|
|
lam = np.random.beta(alpha, alpha)
|
|
idx = torch.randperm(x.size(0))
|
|
|
|
H, W = x.size(2), x.size(3)
|
|
cut_w = int(W * (1 - lam) ** 0.5)
|
|
cut_h = int(H * (1 - lam) ** 0.5)
|
|
cx, cy = np.random.randint(W), np.random.randint(H)
|
|
|
|
x1, y1 = max(cx - cut_w // 2, 0), max(cy - cut_h // 2, 0)
|
|
x2, y2 = min(cx + cut_w // 2, W), min(cy + cut_h // 2, H)
|
|
|
|
x[:, :, y1:y2, x1:x2] = x[idx, :, y1:y2, x1:x2]
|
|
lam = 1 - ((x2 - x1) * (y2 - y1) / (W * H))
|
|
return x, y, y[idx], lam
|
|
```
|
|
|
|
### NLP — back-translation
|
|
```python
|
|
from transformers import pipeline
|
|
|
|
en_to_fr = pipeline('translation', model='Helsinki-NLP/opus-mt-en-fr')
|
|
fr_to_en = pipeline('translation', model='Helsinki-NLP/opus-mt-fr-en')
|
|
|
|
def back_translate(text):
|
|
fr = en_to_fr(text, max_length=512)[0]['translation_text']
|
|
return fr_to_en(fr, max_length=512)[0]['translation_text']
|
|
|
|
# 매 paraphrase 효과
|
|
augmented = back_translate(original)
|
|
```
|
|
|
|
### nlpaug (NLP utility)
|
|
```python
|
|
import nlpaug.augmenter.word as naw
|
|
|
|
# 매 contextual (BERT-based) synonym
|
|
aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action='substitute')
|
|
augmented = aug.augment('The quick brown fox jumps')
|
|
```
|
|
|
|
### LLM-aided augmentation
|
|
```python
|
|
def llm_paraphrase(text, n=5):
|
|
prompt = f"""Paraphrase the following sentence in {n} different ways while preserving meaning:
|
|
|
|
Original: {text}
|
|
|
|
Output {n} paraphrases, each on a new line."""
|
|
return llm.generate(prompt).split('\n')
|
|
```
|
|
|
|
### Audio (SpecAugment)
|
|
```python
|
|
import torchaudio.transforms as T
|
|
|
|
aug = torch.nn.Sequential(
|
|
T.FrequencyMasking(freq_mask_param=30),
|
|
T.TimeMasking(time_mask_param=80),
|
|
)
|
|
|
|
mel_spec_augmented = aug(mel_spec)
|
|
```
|
|
|
|
### SMOTE (tabular)
|
|
```python
|
|
from imblearn.over_sampling import SMOTE
|
|
|
|
smote = SMOTE(random_state=42)
|
|
X_resampled, y_resampled = smote.fit_resample(X, y)
|
|
```
|
|
|
|
### Diffusion-based augmentation
|
|
```python
|
|
from diffusers import StableDiffusionImg2ImgPipeline
|
|
|
|
pipe = StableDiffusionImg2ImgPipeline.from_pretrained('runwayml/stable-diffusion-v1-5').to('cuda')
|
|
|
|
# 매 original image + 매 prompt 의 variation
|
|
augmented = pipe(
|
|
prompt='a {class_name} in different lighting / angle',
|
|
image=original_image,
|
|
strength=0.3, # 매 small change
|
|
num_inference_steps=20,
|
|
).images[0]
|
|
```
|
|
|
|
### Test-time augmentation (TTA)
|
|
```python
|
|
def tta_predict(model, image, n=5):
|
|
"""매 매 prediction 의 augment + 매 average."""
|
|
augments = [normal_transform, flip_transform, crop1_transform, ...]
|
|
preds = [model(aug(image)) for aug in augments[:n]]
|
|
return torch.stack(preds).mean(dim=0)
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Strategy |
|
|
|---|---|
|
|
| Image classification | RandAugment + MixUp |
|
|
| Detection | Albumentations + Mosaic |
|
|
| Segmentation | Mask-aware augment |
|
|
| NLP | Back-translation + LLM paraphrase |
|
|
| Audio | SpecAugment |
|
|
| Imbalanced tabular | SMOTE |
|
|
| Long-tail vision | Class-balanced augment |
|
|
| Generative augment | Diffusion (img2img) |
|
|
|
|
**기본값**: RandAugment / TrivialAugment + MixUp/CutMix (vision). LLM paraphrase (NLP).
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Data-Engineering]] · [[L1-and-L2-Regularization|Regularization]]
|
|
- 변형: [[MixUp]] · [[CutMix]] · [[AutoAugment]] · [[Back-Translation]] · [[SMOTE]]
|
|
- Adjacent: [[Bias vs Variance Trade-off]] · [[Cross-Entropy Loss]] · [[CV_Synthesis]] · [[Antifragility]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 ML training. 매 small dataset. 매 imbalanced. 매 robustness 필요.
|
|
**언제 X**: 매 already strong model + abundant data.
|
|
|
|
## ❌ 안티패턴
|
|
- **Test set 의 augment**: 매 leakage.
|
|
- **Over-augment** (training + test 의 distribute mismatch).
|
|
- **Wrong domain augmentation** (e.g., flipping a "B" → "ⳝ" wrong text).
|
|
- **No bbox-aware** (detection): 매 wrong label.
|
|
- **MixUp 의 label 의 hard target 의 keep**: 매 wrong loss.
|
|
- **Generative augment 의 OOD**: 매 noise.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Cubuk AutoAugment, Zhang MixUp, DeVries CutOut, SpecAugment).
|
|
- 신뢰도 A.
|
|
- Related: [[Bias vs Variance Trade-off]] · [[Cross-Entropy Loss]] · [[CV_Synthesis]] · [[Computer_Vision]] · [[Antifragility]].
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — strategy + 매 torchvision / Albumentations / MixUp / back-translate / TTA code |
|