"매 CV 의 핵심: pixels → semantic understanding via learned hierarchical features". 매 1960s Roberts edge detector 로 시작, 매 2012 AlexNet 으로 deep-learning revolution, 매 2020 ViT, 매 2023 SAM. 매 2026 현재 multimodal foundation models (GPT-5V, Claude Opus 4.7, Qwen3-VL, Llama 4 Vision) 가 zero-shot 으로 detection / VQA / OCR 의 통합.
매 핵심
매 task taxonomy
Classification: 매 image → label.
Detection: 매 bounding boxes + classes (YOLO, DETR, RT-DETR).
Segmentation: 매 pixel-level (semantic, instance, panoptic). SAM2 의 promptable.
Pose / Keypoint: 매 human/object joints (RTMPose, ViTPose).
VLM: Claude Opus 4.7 Vision, GPT-5V, InternVL3, Qwen3-VL.
매 응용
Autonomous driving (Waymo, Tesla FSD perception).
Medical imaging (MONAI, nnU-Net for segmentation).
Document AI / OCR (Donut, Florence-2, GPT-5V).
Robotics (open-vocabulary manipulation, RT-2).
Content moderation, retail, agriculture.
💻 패턴
Image classification (timm + finetune)
importtimm,torchmodel=timm.create_model("convnextv2_tiny.fcmae_ft_in22k_in1k",pretrained=True,num_classes=10)opt=torch.optim.AdamW(model.parameters(),lr=1e-4)# train as usual
언제: 매 quick image-understanding tasks (VQA, OCR, caption), 매 dataset bootstrapping (label generation), 매 vision-pipeline scaffolding.
언제 X: 매 high-throughput / low-latency production — 매 specialized model 의 use. 매 medical / safety-critical 은 validated model only.
❌ 안티패턴
Re-inventing vs. timm/ultralytics: 매 well-tested baselines 의 무시 X.
No domain-specific augmentation: 매 medical/satellite 의 ImageNet aug 의 그대로 사용.
Ignoring image preprocessing: 매 wrong normalization 의 가장 흔한 bug.
VLM 의 fine-grained 작업 의 무비판 신뢰: 매 small-object detection / counting 의 hallucination.
No test-time augmentation for production: 매 robustness 손실.