--- id: wiki-2026-0508-image-classification-mastery title: Image Classification category: 10_Wiki/Topics status: verified canonical_id: self aliases: [image classification, ResNet, ViT, EfficientNet, ImageNet, CLIP] duplicate_of: none source_trust_level: A confidence_score: 0.96 verification_status: applied tags: [computer-vision, classification, resnet, vit, efficientnet, clip, imagenet] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: PyTorch / timm / Transformers --- # Image Classification ## 매 한 줄 > **"매 image → class label"**. 매 ImageNet benchmark. 매 evolution: AlexNet 2012 → VGG → ResNet 2015 → EfficientNet → ViT 2020 → CLIP / DINOv2. 매 modern: 매 foundation model 의 zero-shot. ## 매 핵심 ### 매 architecture evolution - **AlexNet** (2012): 매 deep learning revival. - **VGG** (2014): 매 deeper. - **ResNet** (2015): 매 skip connection. - **EfficientNet** (2019): 매 compound scaling. - **ViT** (2020): 매 transformer. - **ConvNeXt** (2022): 매 modern CNN. - **DINOv2** (2023): 매 self-supervised. - **CLIP** (2021): 매 zero-shot. ### 매 응용 1. **Medical** (X-ray, pathology). 2. **Industrial** (defect detection). 3. **Retail** (visual search). 4. **Wildlife** (camera trap). 5. **Content moderation**. ## 💻 패턴 ### timm (modern model zoo) ```python import timm model = timm.create_model('vit_base_patch16_224', pretrained=True) data_config = timm.data.resolve_data_config({}, model=model) transforms = timm.data.create_transform(**data_config) ``` ### Fine-tune (PyTorch) ```python import torch from torchvision.models import resnet50, ResNet50_Weights model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2) model.fc = torch.nn.Linear(2048, n_classes) # 매 freeze backbone (transfer learning baseline) for p in model.parameters(): p.requires_grad = False for p in model.fc.parameters(): p.requires_grad = True optim = torch.optim.AdamW(model.fc.parameters(), lr=1e-3) ``` ### CLIP zero-shot ```python from transformers import CLIPProcessor, CLIPModel model = CLIPModel.from_pretrained('openai/clip-vit-large-patch14') processor = CLIPProcessor.from_pretrained('openai/clip-vit-large-patch14') texts = ['a photo of a dog', 'a photo of a cat', 'a photo of a bird'] inputs = processor(text=texts, images=image, return_tensors='pt') out = model(**inputs) probs = out.logits_per_image.softmax(dim=-1) ``` ### DINOv2 (self-supervised features) ```python import torch dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14') features = dinov2(image) # 매 frozen embedding for downstream ``` ### Mixup ```python def mixup(x, y, alpha=0.4): lam = np.random.beta(alpha, alpha) idx = torch.randperm(x.size(0)) return lam * x + (1 - lam) * x[idx], y, y[idx], lam ``` ### CutMix ```python def cutmix(x, y, alpha=1.0): lam = np.random.beta(alpha, alpha) H, W = x.shape[-2:] cut_w = int(W * (1 - lam) ** 0.5) cut_h = int(H * (1 - lam) ** 0.5) cx, cy = np.random.randint(W), np.random.randint(H) x1, y1 = max(0, cx - cut_w//2), max(0, cy - cut_h//2) x2, y2 = min(W, cx + cut_w//2), min(H, cy + cut_h//2) idx = torch.randperm(x.size(0)) x[:, :, y1:y2, x1:x2] = x[idx, :, y1:y2, x1:x2] return x, y, y[idx], 1 - (x2-x1)*(y2-y1)/(W*H) ``` ### Augmentation (albumentations) ```python import albumentations as A augment = A.Compose([ A.RandomResizedCrop(224, 224), A.HorizontalFlip(), A.ColorJitter(0.2, 0.2, 0.2, 0.1), A.RandomErasing(p=0.25), A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) ``` ### Test-Time Augmentation ```python def tta_predict(model, image, n_aug=5): predictions = [] for _ in range(n_aug): aug_img = random_augment(image) predictions.append(model(aug_img).softmax(-1)) return torch.stack(predictions).mean(0) ``` ### Top-K accuracy ```python def topk_accuracy(logits, labels, k=5): topk = logits.topk(k, dim=-1).indices correct = (topk == labels.unsqueeze(-1)).any(dim=-1).float().mean() return correct ``` ### Model card (best practice) ```yaml model: my-classifier-v2 backbone: vit_base_patch16_224 training_data: ImageNet-1k + custom 100k classes: 1000 augmentation: RandAugment + Mixup + CutMix ttest_top1: 84.5 test_top5: 97.2 calibration_ece: 0.034 inference_ms_a100: 8 ``` ### Modern recipe (DeiT, ViT) ```python def modern_train_recipe(): return { 'optimizer': 'AdamW', 'lr': 1e-3, 'wd': 0.05, 'scheduler': 'cosine + warmup 5 epochs', 'epochs': 300, 'augmentation': 'RandAugment + Mixup 0.8 + CutMix 1.0', 'label_smoothing': 0.1, 'stochastic_depth': 0.1, 'ema': True, } ``` ## 매 결정 기준 | 상황 | Model | |---|---| | Need pretrained | timm | | Best ImageNet | DeiT III / ViT-L | | Mobile | MobileNetV3 / EfficientNet-Lite | | Zero-shot | CLIP | | Self-supervised | DINOv2 | | Tiny | ResNet18 / EfficientNet-B0 | **기본값**: 매 timm + 매 ViT-B/L pretrained + 매 modern recipe (RandAug + Mixup + CutMix + label smooth) + 매 TTA 의 critical eval. ## 🔗 Graph - 부모: [[Computer Vision|Computer-Vision]] - 변형: [[ResNet]] · [[ViT]] · [[EfficientNet]] - 응용: [[CLIP]] · [[Image-Segmentation]] - Adjacent: [[Foundation-Models]] ## 🤖 LLM 활용 **언제**: 매 image task. 매 visual search. 매 medical. **언제 X**: 매 segmentation / detection (다른 task). ## ❌ 안티패턴 - **Train from scratch**: 매 timm pretrained 의 use. - **No augment**: 매 overfit. - **Top-1 only**: 매 also top-5 / calibration. - **No TTA at eval**: 매 lose 1-2%. ## 🧪 검증 / 중복 - Verified (timm, He ResNet 2015, Dosovitskiy ViT 2020, Oquab DINOv2 2023). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — 매 evolution + timm / CLIP / DINOv2 / Mixup / TTA code |