--- id: wiki-2026-0508-computer-vision title: Computer Vision category: 10_Wiki/Topics status: verified canonical_id: self aliases: [CV, computer vision, image classification, object detection, segmentation, ViT, CLIP, SAM, depth estimation] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [computer-vision, deep-learning, cnn, vit, segmentation, detection, sam, clip, dino, image-classification] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: PyTorch / Transformers / Detectron2 / Ultralytics / SAM --- # Computer Vision ## 매 한 줄 > **"매 pixel 의 meaning"**. 매 classification → 매 detection → 매 segmentation → 매 depth → 매 generation. 매 modern: 매 ViT 의 dominant + 매 foundation model (CLIP, SAM, DINOv2). 매 multi-modal LLM 의 vision encoder 의 base. ## 매 핵심 task ### Classification - 매 image → 매 class. - 매 ImageNet, 매 ResNet, 매 ViT. ### Detection - 매 image → 매 bbox + class. - 매 [[Bounding-Box-Regression]] 참조. - 매 YOLO, DETR. ### Segmentation - **Semantic**: 매 pixel → 매 class. - **Instance**: 매 pixel → 매 instance. - **Panoptic**: 매 결합. - 매 SAM (Segment Anything). ### Depth estimation - **Monocular**: 매 single image → 매 depth. - **Stereo**: 매 두 camera. - 매 MiDaS, Depth Anything. ### Pose estimation - **2D / 3D**. - 매 OpenPose, MediaPipe, ViTPose. ### Tracking - 매 video 의 across frames. - 매 ByteTrack, BoT-SORT. ### Generation - 매 GAN, Diffusion, Stable Diffusion. - 매 [[AI 이미지 생성 및 편집 워크플로우 (AI Image Generation & Editing Workflow)]] 참조. ### OCR - 매 text from image. - 매 PaddleOCR, Tesseract, GPT-4V. ### Action recognition - 매 video understanding. ### Re-Identification - 매 person / vehicle re-id. ### 3D vision - 매 NeRF, Gaussian Splatting. - 매 [[Automated_Mapping]] 참조. ## 매 architecture history ### CNN era (2012-2020) - AlexNet (2012) → 매 ImageNet revolution. - VGG, ResNet (skip connection), DenseNet, EfficientNet. - 매 inductive bias: locality + translation invariance. ### ViT era (2020+) - ViT (Dosovitskiy 2020). - 매 patch + transformer. - 매 large data 의 dominate. - Swin, DeiT, MAE pretrain. ### Foundation model (2021+) - **CLIP**: 매 image-text contrastive. - **DINO / DINOv2**: 매 self-supervised. - **MAE**: 매 masked autoencoder. - **SAM**: 매 segment anything. - **Depth Anything**: 매 universal depth. ### Multi-modal (2023+) - **GPT-4V, Claude vision, Gemini**: 매 LLM + vision. - **LLaVA, Qwen-VL**: 매 open. - **Sora, Veo**: 매 video generation. ## 💻 패턴 ### Image classification (ViT, HuggingFace) ```python from transformers import ViTImageProcessor, ViTForImageClassification from PIL import Image processor = ViTImageProcessor.from_pretrained('google/vit-large-patch16-384') model = ViTForImageClassification.from_pretrained('google/vit-large-patch16-384') image = Image.open('cat.jpg') inputs = processor(images=image, return_tensors='pt') outputs = model(**inputs) predicted_idx = outputs.logits.argmax(-1).item() print(model.config.id2label[predicted_idx]) ``` ### Object detection (YOLO) ```python from ultralytics import YOLO model = YOLO('yolov8x.pt') results = model('image.jpg', conf=0.5) for r in results: for box in r.boxes: print(f'{model.names[int(box.cls)]}: {box.conf.item():.2f} at {box.xyxy[0].tolist()}') ``` ### Segmentation (SAM) ```python from segment_anything import sam_model_registry, SamPredictor import cv2 sam = sam_model_registry['vit_h'](checkpoint='sam_vit_h.pth').to('cuda') predictor = SamPredictor(sam) predictor.set_image(image) # 매 prompt: bbox or point masks, scores, _ = predictor.predict( box=np.array([100, 100, 400, 400]), multimask_output=False, ) ``` ### Depth estimation (Depth Anything) ```python from transformers import pipeline pipe = pipeline('depth-estimation', model='depth-anything/Depth-Anything-V2-Large-hf') depth = pipe(image)['depth'] depth.save('depth.png') ``` ### CLIP zero-shot ```python import open_clip model, _, preprocess = open_clip.create_model_and_transforms('ViT-L-14', pretrained='openai') tokenizer = open_clip.get_tokenizer('ViT-L-14') candidates = ['a cat', 'a dog', 'a bird'] text = tokenizer(candidates) img = preprocess(image).unsqueeze(0) with torch.no_grad(): img_feat = model.encode_image(img) / ... text_feat = model.encode_text(text) / ... similarity = (100 * img_feat @ text_feat.T).softmax(-1) ``` ### Pose estimation (MediaPipe) ```python import mediapipe as mp mp_pose = mp.solutions.pose with mp_pose.Pose() as pose: results = pose.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB)) if results.pose_landmarks: for lm in results.pose_landmarks.landmark: print(lm.x, lm.y, lm.z, lm.visibility) ``` ### Tracking (ByteTrack) ```python from yolox.tracker.byte_tracker import BYTETracker tracker = BYTETracker(args) for frame in video: detections = detector(frame) # 매 (N, 5): xyxy + conf tracked = tracker.update(detections, frame_size, frame_size) for t in tracked: print(t.track_id, t.tlbr, t.score) ``` ### OCR (PaddleOCR) ```python from paddleocr import PaddleOCR ocr = PaddleOCR(use_angle_cls=True, lang='en') result = ocr.ocr('document.jpg', cls=True) for line in result[0]: bbox, (text, conf) = line print(text, conf) ``` ### Multi-modal (GPT-4V via API) ```python from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model='gpt-4o', messages=[{ 'role': 'user', 'content': [ {'type': 'text', 'text': 'What objects do you see, and where are they?'}, {'type': 'image_url', 'image_url': {'url': image_url}}, ], }], ) print(response.choices[0].message.content) ``` ### Self-supervised pre-train (MAE, simplified) ```python def mae_pretrain(model, image, mask_ratio=0.75): patches = patchify(image, patch_size=16) n_visible = int(len(patches) * (1 - mask_ratio)) visible_idx = torch.randperm(len(patches))[:n_visible] encoded = encoder(patches[visible_idx]) full = insert_mask_tokens(encoded, visible_idx, total=len(patches)) reconstructed = decoder(full) loss = ((reconstructed[masked_idx] - patches[masked_idx]) ** 2).mean() return loss ``` ### NeRF (volumetric 3D) ```python # 매 [[Automated_Mapping]] 참조 — 매 NeRF / Gaussian Splatting code ``` ## 🤔 결정 기준 | Task | Tool | |---|---| | Classify | ViT / EfficientNet | | Detect | YOLOv8 / DETR / Grounding DINO | | Segment | SAM (open-vocab) / Mask2Former | | Depth | Depth Anything V2 | | Pose | MediaPipe / ViTPose | | Track | ByteTrack | | OCR | PaddleOCR / GPT-4V | | Zero-shot | CLIP / SigLIP | | Generate | Stable Diffusion / Flux | | Edge | YOLOv8n / MobileNetV4 | | Foundation feature | DINOv2 | **기본값**: 매 task-specific SOTA + 매 CLIP / SAM 의 zero-shot fallback. ## 🔗 Graph - 부모: [[AI]] · [[Deep Learning]] - 변형: [[CNN]] · [[ViT]] · [[CLIP]] · [[SAM]] · [[MAE]] - 응용: [[Object-Detection]] · [[Bounding-Box-Regression]] · [[Automated_Mapping]] · [[Autonomous Vehicles]] · [[Algorithmic-Biology]] - Adjacent: [[Diffusion-Models]] · [[CV_Synthesis]] ## 🤖 LLM 활용 **언제**: 매 vision task. 매 multimodal product. 매 image search. 매 autonomous system. **언제 X**: 매 audio / pure text. 매 1D signal. ## ❌ 안티패턴 - **Custom CNN from scratch (small data)**: 매 pretrain 의 use. - **No augmentation**: 매 generalization X. - **ImageNet only eval**: 매 distribution shift. - **No domain adapt**: 매 medical / satellite 의 weak. - **Single model for all task**: 매 specialized 의 better. ## 🧪 검증 / 중복 - Verified (ImageNet, ViT, CLIP, SAM papers). - 신뢰도 A. - Related: [[CLIP]] · [[Bounding-Box-Regression]] · [[Automated_Mapping]] · [[Autonomous Vehicles]] · [[CV_Synthesis]] · [[Algorithmic-Biology]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — task taxonomy + history + 매 ViT / YOLO / SAM / Depth / CLIP / GPT-4V code |