[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -1,89 +1,279 @@
 ---
 id: wiki-2026-0508-clip
-title: CLIP
+title: CLIP (Contrastive Language-Image Pre-training)
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [CLIP-001]
+aliases: [CLIP, OpenCLIP, contrastive vision-language, zero-shot image, EVA-CLIP, SigLIP, image-text embedding]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 1.0
-tags: [ai, Computer-Vision, nlp, multimodal, clip, openai]
+confidence_score: 0.95
+verification_status: applied
+tags: [clip, vision-language, multimodal, contrastive-learning, zero-shot, foundation-model, embedding, openai]
 raw_sources: []
-last_reinforced: 2026-04-26
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
 tech_stack:
-  language: unspecified
-  framework: unspecified
+  language: Python
+  framework: open_clip / Transformers / CLIP
 ---

-# CLIP (Contrastive Language-Image Pre-training)
+# CLIP

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "이미지와 텍스트를 하나의 언어로 묶어 AI에게 시각적 문해력을 부여하라" — OpenAI가 제안한 모델로, 인터넷상의 방대한 이미지와 설명 텍스트 쌍을 대조 학습(Contrastive Learning)하여 시각적 개념을 언어적으로 이해하게 만든 혁신적인 멀티모달 모델.
+## 📌 한 줄 통찰
+> **"매 image + text 의 shared embedding"**. 매 contrastive learning + 매 internet-scale (image, caption) pair → 매 zero-shot vision. 매 modern: 매 SigLIP, 매 EVA-CLIP, 매 OpenCLIP. 매 Stable Diffusion / DALL-E / multi-modal LLM 의 vision encoder 의 base.

-## 📖 구조화된 지식 (Synthesized Content)
- **추출된 패턴:** 이미지 임베딩과 텍스트 임베딩을 동일한 공유 잠재 공간(Shared Latent Space)에 매핑하여, 특정 텍스트 설명에 가장 잘 어울리는 이미지를 찾아내는 시각-언어 정렬 패턴.
- **핵심 특징:**
-    - **Contrastive Learning:** 관련 있는 이미지-텍스트 쌍은 가깝게, 관련 없는 쌍은 멀게 배치하도록 학습.
-    - **Zero-shot Visual Recognition:** 학습 데이터에 없던 새로운 물체라도 텍스트 설명을 통해 인식 가능.
-    - **[[Robustness|Robustness]]:** 특정 데이터셋(ImageNet 등)에 과적합되지 않고 실제 환경의 다양한 이미지에 대해 뛰어난 일반화 성능을 보임.
-    - **Foundation for GenAI:** DALL-E, Stable Diffusion 등 텍스트-투-이미지 생성 모델의 핵심 눈(Eye) 역할을 수행.
+## 📖 핵심

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌:** 숫자로 된 클래스 라벨(예: 0=개, 1=고양이)로만 이미지를 배우던 방식에서, 자연어 설명을 통해 이미지의 풍부한 맥락을 배우는 방식으로 패러다임 전환.
- **정책 변화:** Antigravity 프로젝트의 '멀티모달 지식 인덱싱'은 CLIP 아키텍처를 활용하여 위키 내의 이미지와 도표를 텍스트 검색 결과에 자연스럽게 노출시킴.
+### 매 architecture
+- **Image encoder**: ViT or CNN (ResNet).
+- **Text encoder**: Transformer.
+- **Projection**: 매 둘 다 same dim.
+- 매 cosine similarity.

-## 🔗 지식 연결 (Graph)
- [[Transformer-Architecture|Transformer-Architecture]], [[Zero-Shot-Learning|Zero-Shot-Learning]], [[Representation-Learning|Representation-Learning]], [[LLM|LLM]]
- **Raw Source:** 10_Wiki/Topics/AI/CLIP.md
+### 매 training
+- 매 N (image, text) batch.
+- 매 cross-entropy on N×N similarity matrix.
+- 매 diagonal (matched) 의 maximize, 매 off-diagonal 의 minimize.
+- 매 LAION-5B 등 의 internet pair.

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+### 매 InfoNCE loss
+$$L = -\log\frac{e^{sim(I_i, T_i) / \tau}}{\sum_j e^{sim(I_i, T_j) / \tau}}$$

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+- 매 τ = temperature (learnable).

-**언제 쓰면 안 되는가:**
- *(TODO)*
+### 매 zero-shot classification
+1. 매 candidate text (prompt): "a photo of a {class}".
+2. 매 image embedding.
+3. 매 most similar text 의 select.
+- 매 ImageNet 의 76% (CLIP ViT-L) — 매 supervised 와 비슷.

-## 🧪 검증 상태 (Validation)
+### 매 variant

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+#### OpenCLIP (LAION)
+- 매 open-source reproduction.
+- 매 다양한 size + dataset.

-## 🧬 중복 검사 (Duplicate Check)
+#### EVA-CLIP (BAAI)
+- 매 mask image modeling 의 init.
+- 매 SOTA at scale.

- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
+#### SigLIP (Google 2023)
+- 매 sigmoid loss (vs softmax).
+- 매 batch size 의 robust.
+- 매 better at smaller batch.

-## 🕓 변경 이력 (Changelog)
+#### MetaCLIP (Meta)
+- 매 data curation 의 spec.
+- 매 quality > quantity.

-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
+#### CLIP-LoRA fine-tune
+- 매 domain 의 adapt.
+- 매 few-shot.

-## 💻 코드 패턴 (Code Patterns)
+### 매 응용

-**패턴 1:** *(TODO: 이 프로젝트 컨벤션 반영한 구조 스켈레톤)*
+#### Zero-shot classification
+- 매 ImageNet, 매 medical, 매 satellite.
+- 매 fine-tune 없이.

-```text
-# TODO
+#### Image search
+- 매 text → 매 image (e.g., Pinterest).
+- 매 embedding similarity.
+
+#### Stable Diffusion / DALL-E
+- 매 text encoder.
+- 매 cross-attention conditioning.
+
+#### Multi-modal LLM (LLaVA, GPT-4V)
+- 매 vision encoder.
+- 매 LLM 의 input.
+
+#### CLIP score (eval)
+- 매 generated image 의 prompt 의 alignment.
+
+#### OWL-ViT (open-vocab detection)
+- 매 CLIP 의 detection 의 extension.
+
+#### Retrieval (CLIP4Clip, ViCLIP)
+- 매 video 의 text.
+
+### 매 limitation
+1. **Compositional**: 매 "red cube on blue ball" 의 weak.
+2. **Counting**: 매 "3 dogs" 의 wrong.
+3. **OCR**: 매 small text 의 fail (some).
+4. **Spatial**: 매 left/right.
+5. **Fine-grained**: 매 bird species.
+6. **Bias**: 매 web data 의 bias.
+7. **Adversarial**: 매 typographic attack.
+
+## 💻 패턴
+
+### Zero-shot classification (open_clip)
+```python
+import torch
+import open_clip
+from PIL import Image
+
+model, _, preprocess = open_clip.create_model_and_transforms('ViT-L-14', pretrained='laion2b_s32b_b82k')
+tokenizer = open_clip.get_tokenizer('ViT-L-14')
+
+image = preprocess(Image.open('cat.jpg')).unsqueeze(0)
+classes = ['a photo of a cat', 'a photo of a dog', 'a photo of a fish']
+text = tokenizer(classes)
+
+with torch.no_grad():
+    image_features = model.encode_image(image)
+    text_features = model.encode_text(text)
+    image_features /= image_features.norm(dim=-1, keepdim=True)
+    text_features /= text_features.norm(dim=-1, keepdim=True)
+    
+    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
+
+for cls, prob in zip(classes, similarity[0]):
+    print(f'{cls}: {prob.item():.3f}')
 ```

-## 🤔 의사결정 기준 (Decision Criteria)
+### Image-text retrieval
+```python
+def search_images(query, image_index, top_k=5):
+    text = tokenizer([query])
+    with torch.no_grad():
+        text_emb = model.encode_text(text)
+        text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True)
+    
+    # 매 image_index = (N, D) 매 normalized
+    similarity = text_emb @ image_index.T
+    top_k_idx = similarity[0].topk(top_k).indices
+    return top_k_idx.tolist()
+```

-**선택 A를 써야 할 때:**
- *(TODO)*
+### Fine-tune (CLIP + LoRA)
+```python
+from peft import LoraConfig, get_peft_model
+import open_clip

-**선택 B를 써야 할 때:**
- *(TODO)*
+model, _, _ = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')

-**기본값:**
-> *(TODO)*
+lora_config = LoraConfig(
+    r=16, lora_alpha=32,
+    target_modules=['attn.in_proj_weight', 'attn.out_proj'],
+    lora_dropout=0.1,
+)
+model.visual = get_peft_model(model.visual, lora_config)

-## ❌ 안티패턴 (Anti-Patterns)
+# 매 (image, text) pair 의 contrastive train
+def contrastive_loss(image_emb, text_emb, temp=0.07):
+    image_emb = image_emb / image_emb.norm(dim=-1, keepdim=True)
+    text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True)
+    
+    logits = (image_emb @ text_emb.T) / temp
+    labels = torch.arange(len(image_emb), device=image_emb.device)
+    return (F.cross_entropy(logits, labels) + F.cross_entropy(logits.T, labels)) / 2
+```

- **[안티패턴]:** *(TODO: 무엇을 하면 안 되는가 + 이유 + 대신 무엇을)*
+### CLIP score (eval generated image)
+```python
+def clip_score(image, prompt):
+    img = preprocess(image).unsqueeze(0)
+    txt = tokenizer([prompt])
+    with torch.no_grad():
+        img_emb = model.encode_image(img)
+        txt_emb = model.encode_text(txt)
+        img_emb /= img_emb.norm(dim=-1, keepdim=True)
+        txt_emb /= txt_emb.norm(dim=-1, keepdim=True)
+    return (img_emb @ txt_emb.T).item()  # 매 0-1
+```
+
+### SigLIP (Google)
+```python
+from transformers import AutoProcessor, AutoModel
+
+processor = AutoProcessor.from_pretrained('google/siglip-large-patch16-384')
+model = AutoModel.from_pretrained('google/siglip-large-patch16-384')
+
+inputs = processor(text=['a cat', 'a dog'], images=image, return_tensors='pt', padding='max_length')
+outputs = model(**inputs)
+
+# 매 sigmoid (independent prob) — 매 softmax 가 X
+probs = torch.sigmoid(outputs.logits_per_image)
+```
+
+### LLaVA-style (CLIP + LLM)
+```python
+class LLaVA(nn.Module):
+    def __init__(self):
+        self.vision_encoder = CLIPVisionModel.from_pretrained('openai/clip-vit-large-patch14-336')
+        self.projector = nn.Linear(1024, 4096)  # 매 image dim → LLM dim
+        self.llm = LlamaForCausalLM.from_pretrained('meta-llama/Llama-3-8B-Instruct')
+    
+    def forward(self, image, text_input_ids):
+        image_features = self.vision_encoder(image).last_hidden_state
+        image_tokens = self.projector(image_features)
+        
+        # 매 text embed + image token 의 concat
+        text_embeds = self.llm.model.embed_tokens(text_input_ids)
+        full_embeds = torch.cat([image_tokens, text_embeds], dim=1)
+        
+        return self.llm(inputs_embeds=full_embeds)
+```
+
+### OWL-ViT (open-vocab detection)
+```python
+from transformers import OwlViTProcessor, OwlViTForObjectDetection
+
+processor = OwlViTProcessor.from_pretrained('google/owlvit-base-patch32')
+model = OwlViTForObjectDetection.from_pretrained('google/owlvit-base-patch32')
+
+texts = [['a cat', 'a dog', 'a remote']]
+inputs = processor(text=texts, images=image, return_tensors='pt')
+outputs = model(**inputs)
+
+results = processor.post_process(outputs=outputs, target_sizes=torch.tensor([image.size[::-1]]))
+for box, score, label in zip(results[0]['boxes'], results[0]['scores'], results[0]['labels']):
+    if score > 0.5:
+        print(f'{texts[0][label]}: {box}')
+```
+
+## 🤔 결정 기준
+| 응용 | Model |
+|---|---|
+| Zero-shot class | OpenCLIP / SigLIP |
+| Image retrieval | CLIP + FAISS |
+| Generation conditioning | CLIP / T5 (newer) |
+| Multi-modal LLM | CLIP encoder + LLM |
+| Open-vocab detection | OWL-ViT / Grounding DINO |
+| Eval generated image | CLIP score |
+| Fine-grained | DINOv2 / SigLIP |
+| Domain adapt | CLIP + LoRA |
+
+**기본값**: SigLIP (modern) > OpenCLIP > original CLIP. 매 generation = SDXL / Flux 의 internal.
+
+## 🔗 Graph
+- 부모: [[Multimodal-Learning]] · [[Contrastive-Learning]] · [[Foundation-Model]]
+- 변형: [[OpenCLIP]] · [[SigLIP]] · [[EVA-CLIP]] · [[MetaCLIP]] · [[ALIGN]]
+- 응용: [[Stable-Diffusion]] · [[DALL-E]] · [[LLaVA]] · [[GPT-4V]] · [[OWL-ViT]] · [[Grounding-DINO]]
+- Adjacent: [[Vision-Transformer]] · [[BERT]] · [[Sentence-Transformers]] · [[Zero-Shot-Learning]]
+
+## 🤖 LLM 활용
+**언제**: 매 multimodal task. 매 image search. 매 zero-shot classification. 매 generation conditioning. 매 LLM 의 vision.
+**언제 X**: 매 fine-grained classification (specialty). 매 OCR-heavy.
+
+## ❌ 안티패턴
+- **모든 task 의 CLIP**: 매 fine-grained / OCR 의 weak.
+- **No domain adapt**: 매 medical / satellite 의 weak.
+- **Compositional reasoning expectation**: 매 "red on blue" 의 fail.
+- **Counting expectation**: 매 X.
+- **Adversarial input 의 trust**: 매 typographic attack.
+- **Single template prompt**: 매 ensemble 의 보통 좋음.
+
+## 🧪 검증 / 중복
+- Verified (Radford 2021 CLIP, Zhai 2023 SigLIP, OpenCLIP).
+- 신뢰도 A.
+- Related: [[Stable-Diffusion]] · [[Foundation-Model]] · [[Multimodal-Learning]] · [[Vision-Transformer]] · [[Sentence-Transformers]].
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — variant + InfoNCE + 매 OpenCLIP / SigLIP / LLaVA / OWL-ViT code |