d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
8.0 KiB
8.0 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | |||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-computer-vision | Computer Vision | 10_Wiki/Topics | verified | self |
|
none | A | 0.95 | applied |
|
2026-05-10 | pending |
|
Computer Vision
매 한 줄
"매 pixel 의 meaning". 매 classification → 매 detection → 매 segmentation → 매 depth → 매 generation. 매 modern: 매 ViT 의 dominant + 매 foundation model (CLIP, SAM, DINOv2). 매 multi-modal LLM 의 vision encoder 의 base.
매 핵심 task
Classification
- 매 image → 매 class.
- 매 ImageNet, 매 ResNet, 매 ViT.
Detection
- 매 image → 매 bbox + class.
- 매 Bounding-Box-Regression 참조.
- 매 YOLO, DETR.
Segmentation
- Semantic: 매 pixel → 매 class.
- Instance: 매 pixel → 매 instance.
- Panoptic: 매 결합.
- 매 SAM (Segment Anything).
Depth estimation
- Monocular: 매 single image → 매 depth.
- Stereo: 매 두 camera.
- 매 MiDaS, Depth Anything.
Pose estimation
- 2D / 3D.
- 매 OpenPose, MediaPipe, ViTPose.
Tracking
- 매 video 의 across frames.
- 매 ByteTrack, BoT-SORT.
Generation
- 매 GAN, Diffusion, Stable Diffusion.
- 매 AI 이미지 생성 및 편집 워크플로우 (AI Image Generation & Editing Workflow) 참조.
OCR
- 매 text from image.
- 매 PaddleOCR, Tesseract, GPT-4V.
Action recognition
- 매 video understanding.
Re-Identification
- 매 person / vehicle re-id.
3D vision
- 매 NeRF, Gaussian Splatting.
- 매 Automated_Mapping 참조.
매 architecture history
CNN era (2012-2020)
- AlexNet (2012) → 매 ImageNet revolution.
- VGG, ResNet (skip connection), DenseNet, EfficientNet.
- 매 inductive bias: locality + translation invariance.
ViT era (2020+)
- ViT (Dosovitskiy 2020).
- 매 patch + transformer.
- 매 large data 의 dominate.
- Swin, DeiT, MAE pretrain.
Foundation model (2021+)
- CLIP: 매 image-text contrastive.
- DINO / DINOv2: 매 self-supervised.
- MAE: 매 masked autoencoder.
- SAM: 매 segment anything.
- Depth Anything: 매 universal depth.
Multi-modal (2023+)
- GPT-4V, Claude vision, Gemini: 매 LLM + vision.
- LLaVA, Qwen-VL: 매 open.
- Sora, Veo: 매 video generation.
💻 패턴
Image classification (ViT, HuggingFace)
from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
processor = ViTImageProcessor.from_pretrained('google/vit-large-patch16-384')
model = ViTForImageClassification.from_pretrained('google/vit-large-patch16-384')
image = Image.open('cat.jpg')
inputs = processor(images=image, return_tensors='pt')
outputs = model(**inputs)
predicted_idx = outputs.logits.argmax(-1).item()
print(model.config.id2label[predicted_idx])
Object detection (YOLO)
from ultralytics import YOLO
model = YOLO('yolov8x.pt')
results = model('image.jpg', conf=0.5)
for r in results:
for box in r.boxes:
print(f'{model.names[int(box.cls)]}: {box.conf.item():.2f} at {box.xyxy[0].tolist()}')
Segmentation (SAM)
from segment_anything import sam_model_registry, SamPredictor
import cv2
sam = sam_model_registry['vit_h'](checkpoint='sam_vit_h.pth').to('cuda')
predictor = SamPredictor(sam)
predictor.set_image(image)
# 매 prompt: bbox or point
masks, scores, _ = predictor.predict(
box=np.array([100, 100, 400, 400]),
multimask_output=False,
)
Depth estimation (Depth Anything)
from transformers import pipeline
pipe = pipeline('depth-estimation', model='depth-anything/Depth-Anything-V2-Large-hf')
depth = pipe(image)['depth']
depth.save('depth.png')
CLIP zero-shot
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('ViT-L-14', pretrained='openai')
tokenizer = open_clip.get_tokenizer('ViT-L-14')
candidates = ['a cat', 'a dog', 'a bird']
text = tokenizer(candidates)
img = preprocess(image).unsqueeze(0)
with torch.no_grad():
img_feat = model.encode_image(img) / ...
text_feat = model.encode_text(text) / ...
similarity = (100 * img_feat @ text_feat.T).softmax(-1)
Pose estimation (MediaPipe)
import mediapipe as mp
mp_pose = mp.solutions.pose
with mp_pose.Pose() as pose:
results = pose.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
if results.pose_landmarks:
for lm in results.pose_landmarks.landmark:
print(lm.x, lm.y, lm.z, lm.visibility)
Tracking (ByteTrack)
from yolox.tracker.byte_tracker import BYTETracker
tracker = BYTETracker(args)
for frame in video:
detections = detector(frame) # 매 (N, 5): xyxy + conf
tracked = tracker.update(detections, frame_size, frame_size)
for t in tracked:
print(t.track_id, t.tlbr, t.score)
OCR (PaddleOCR)
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang='en')
result = ocr.ocr('document.jpg', cls=True)
for line in result[0]:
bbox, (text, conf) = line
print(text, conf)
Multi-modal (GPT-4V via API)
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model='gpt-4o',
messages=[{
'role': 'user',
'content': [
{'type': 'text', 'text': 'What objects do you see, and where are they?'},
{'type': 'image_url', 'image_url': {'url': image_url}},
],
}],
)
print(response.choices[0].message.content)
Self-supervised pre-train (MAE, simplified)
def mae_pretrain(model, image, mask_ratio=0.75):
patches = patchify(image, patch_size=16)
n_visible = int(len(patches) * (1 - mask_ratio))
visible_idx = torch.randperm(len(patches))[:n_visible]
encoded = encoder(patches[visible_idx])
full = insert_mask_tokens(encoded, visible_idx, total=len(patches))
reconstructed = decoder(full)
loss = ((reconstructed[masked_idx] - patches[masked_idx]) ** 2).mean()
return loss
NeRF (volumetric 3D)
# 매 [[Automated_Mapping]] 참조 — 매 NeRF / Gaussian Splatting code
🤔 결정 기준
| Task | Tool |
|---|---|
| Classify | ViT / EfficientNet |
| Detect | YOLOv8 / DETR / Grounding DINO |
| Segment | SAM (open-vocab) / Mask2Former |
| Depth | Depth Anything V2 |
| Pose | MediaPipe / ViTPose |
| Track | ByteTrack |
| OCR | PaddleOCR / GPT-4V |
| Zero-shot | CLIP / SigLIP |
| Generate | Stable Diffusion / Flux |
| Edge | YOLOv8n / MobileNetV4 |
| Foundation feature | DINOv2 |
기본값: 매 task-specific SOTA + 매 CLIP / SAM 의 zero-shot fallback.
🔗 Graph
- 부모: AI · Deep Learning
- 변형: CNN · ViT · CLIP · SAM · MAE
- 응용: Object-Detection · Bounding-Box-Regression · Automated_Mapping · Autonomous Vehicles · Algorithmic-Biology
- Adjacent: Diffusion-Models · CV_Synthesis
🤖 LLM 활용
언제: 매 vision task. 매 multimodal product. 매 image search. 매 autonomous system. 언제 X: 매 audio / pure text. 매 1D signal.
❌ 안티패턴
- Custom CNN from scratch (small data): 매 pretrain 의 use.
- No augmentation: 매 generalization X.
- ImageNet only eval: 매 distribution shift.
- No domain adapt: 매 medical / satellite 의 weak.
- Single model for all task: 매 specialized 의 better.
🧪 검증 / 중복
- Verified (ImageNet, ViT, CLIP, SAM papers).
- 신뢰도 A.
- Related: CLIP · Bounding-Box-Regression · Automated_Mapping · Autonomous Vehicles · CV_Synthesis · Algorithmic-Biology.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — task taxonomy + history + 매 ViT / YOLO / SAM / Depth / CLIP / GPT-4V code |