Files
2nd/10_Wiki/Topics/AI_and_ML/CLIP.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

280 lines
8.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-clip
title: CLIP (Contrastive Language-Image Pre-training)
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [CLIP, OpenCLIP, contrastive vision-language, zero-shot image, EVA-CLIP, SigLIP, image-text embedding]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [clip, vision-language, multimodal, contrastive-learning, zero-shot, foundation-model, embedding, openai]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: open_clip / Transformers / CLIP
---
# CLIP
## 📌 한 줄 통찰
> **"매 image + text 의 shared embedding"**. 매 contrastive learning + 매 internet-scale (image, caption) pair → 매 zero-shot vision. 매 modern: 매 SigLIP, 매 EVA-CLIP, 매 OpenCLIP. 매 Stable Diffusion / DALL-E / multi-modal LLM 의 vision encoder 의 base.
## 📖 핵심
### 매 architecture
- **Image encoder**: ViT or CNN (ResNet).
- **Text encoder**: Transformer.
- **Projection**: 매 둘 다 same dim.
- 매 cosine similarity.
### 매 training
- 매 N (image, text) batch.
- 매 cross-entropy on N×N similarity matrix.
- 매 diagonal (matched) 의 maximize, 매 off-diagonal 의 minimize.
- 매 LAION-5B 등 의 internet pair.
### 매 InfoNCE loss
$$L = -\log\frac{e^{sim(I_i, T_i) / \tau}}{\sum_j e^{sim(I_i, T_j) / \tau}}$$
- 매 τ = temperature (learnable).
### 매 zero-shot classification
1. 매 candidate text (prompt): "a photo of a {class}".
2. 매 image embedding.
3. 매 most similar text 의 select.
- 매 ImageNet 의 76% (CLIP ViT-L) — 매 supervised 와 비슷.
### 매 variant
#### OpenCLIP (LAION)
- 매 open-source reproduction.
- 매 다양한 size + dataset.
#### EVA-CLIP (BAAI)
- 매 mask image modeling 의 init.
- 매 SOTA at scale.
#### SigLIP (Google 2023)
- 매 sigmoid loss (vs softmax).
- 매 batch size 의 robust.
- 매 better at smaller batch.
#### MetaCLIP (Meta)
- 매 data curation 의 spec.
- 매 quality > quantity.
#### CLIP-LoRA fine-tune
- 매 domain 의 adapt.
- 매 few-shot.
### 매 응용
#### Zero-shot classification
- 매 ImageNet, 매 medical, 매 satellite.
- 매 fine-tune 없이.
#### Image search
- 매 text → 매 image (e.g., Pinterest).
- 매 embedding similarity.
#### Stable Diffusion / DALL-E
- 매 text encoder.
- 매 cross-attention conditioning.
#### Multi-modal LLM (LLaVA, GPT-4V)
- 매 vision encoder.
- 매 LLM 의 input.
#### CLIP score (eval)
- 매 generated image 의 prompt 의 alignment.
#### OWL-ViT (open-vocab detection)
- 매 CLIP 의 detection 의 extension.
#### Retrieval (CLIP4Clip, ViCLIP)
- 매 video 의 text.
### 매 limitation
1. **Compositional**: 매 "red cube on blue ball" 의 weak.
2. **Counting**: 매 "3 dogs" 의 wrong.
3. **OCR**: 매 small text 의 fail (some).
4. **Spatial**: 매 left/right.
5. **Fine-grained**: 매 bird species.
6. **Bias**: 매 web data 의 bias.
7. **Adversarial**: 매 typographic attack.
## 💻 패턴
### Zero-shot classification (open_clip)
```python
import torch
import open_clip
from PIL import Image
model, _, preprocess = open_clip.create_model_and_transforms('ViT-L-14', pretrained='laion2b_s32b_b82k')
tokenizer = open_clip.get_tokenizer('ViT-L-14')
image = preprocess(Image.open('cat.jpg')).unsqueeze(0)
classes = ['a photo of a cat', 'a photo of a dog', 'a photo of a fish']
text = tokenizer(classes)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
for cls, prob in zip(classes, similarity[0]):
print(f'{cls}: {prob.item():.3f}')
```
### Image-text retrieval
```python
def search_images(query, image_index, top_k=5):
text = tokenizer([query])
with torch.no_grad():
text_emb = model.encode_text(text)
text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True)
# 매 image_index = (N, D) 매 normalized
similarity = text_emb @ image_index.T
top_k_idx = similarity[0].topk(top_k).indices
return top_k_idx.tolist()
```
### Fine-tune (CLIP + LoRA)
```python
from peft import LoraConfig, get_peft_model
import open_clip
model, _, _ = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
lora_config = LoraConfig(
r=16, lora_alpha=32,
target_modules=['attn.in_proj_weight', 'attn.out_proj'],
lora_dropout=0.1,
)
model.visual = get_peft_model(model.visual, lora_config)
# 매 (image, text) pair 의 contrastive train
def contrastive_loss(image_emb, text_emb, temp=0.07):
image_emb = image_emb / image_emb.norm(dim=-1, keepdim=True)
text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True)
logits = (image_emb @ text_emb.T) / temp
labels = torch.arange(len(image_emb), device=image_emb.device)
return (F.cross_entropy(logits, labels) + F.cross_entropy(logits.T, labels)) / 2
```
### CLIP score (eval generated image)
```python
def clip_score(image, prompt):
img = preprocess(image).unsqueeze(0)
txt = tokenizer([prompt])
with torch.no_grad():
img_emb = model.encode_image(img)
txt_emb = model.encode_text(txt)
img_emb /= img_emb.norm(dim=-1, keepdim=True)
txt_emb /= txt_emb.norm(dim=-1, keepdim=True)
return (img_emb @ txt_emb.T).item() # 매 0-1
```
### SigLIP (Google)
```python
from transformers import AutoProcessor, AutoModel
processor = AutoProcessor.from_pretrained('google/siglip-large-patch16-384')
model = AutoModel.from_pretrained('google/siglip-large-patch16-384')
inputs = processor(text=['a cat', 'a dog'], images=image, return_tensors='pt', padding='max_length')
outputs = model(**inputs)
# 매 sigmoid (independent prob) — 매 softmax 가 X
probs = torch.sigmoid(outputs.logits_per_image)
```
### LLaVA-style (CLIP + LLM)
```python
class LLaVA(nn.Module):
def __init__(self):
self.vision_encoder = CLIPVisionModel.from_pretrained('openai/clip-vit-large-patch14-336')
self.projector = nn.Linear(1024, 4096) # 매 image dim → LLM dim
self.llm = LlamaForCausalLM.from_pretrained('meta-llama/Llama-3-8B-Instruct')
def forward(self, image, text_input_ids):
image_features = self.vision_encoder(image).last_hidden_state
image_tokens = self.projector(image_features)
# 매 text embed + image token 의 concat
text_embeds = self.llm.model.embed_tokens(text_input_ids)
full_embeds = torch.cat([image_tokens, text_embeds], dim=1)
return self.llm(inputs_embeds=full_embeds)
```
### OWL-ViT (open-vocab detection)
```python
from transformers import OwlViTProcessor, OwlViTForObjectDetection
processor = OwlViTProcessor.from_pretrained('google/owlvit-base-patch32')
model = OwlViTForObjectDetection.from_pretrained('google/owlvit-base-patch32')
texts = [['a cat', 'a dog', 'a remote']]
inputs = processor(text=texts, images=image, return_tensors='pt')
outputs = model(**inputs)
results = processor.post_process(outputs=outputs, target_sizes=torch.tensor([image.size[::-1]]))
for box, score, label in zip(results[0]['boxes'], results[0]['scores'], results[0]['labels']):
if score > 0.5:
print(f'{texts[0][label]}: {box}')
```
## 🤔 결정 기준
| 응용 | Model |
|---|---|
| Zero-shot class | OpenCLIP / SigLIP |
| Image retrieval | CLIP + FAISS |
| Generation conditioning | CLIP / T5 (newer) |
| Multi-modal LLM | CLIP encoder + LLM |
| Open-vocab detection | OWL-ViT / Grounding DINO |
| Eval generated image | CLIP score |
| Fine-grained | DINOv2 / SigLIP |
| Domain adapt | CLIP + LoRA |
**기본값**: SigLIP (modern) > OpenCLIP > original CLIP. 매 generation = SDXL / Flux 의 internal.
## 🔗 Graph
- 부모: [[Contrastive-Learning]] · [[Foundation-Model]]
- 변형: [[OpenCLIP]] · [[SigLIP]] · [[EVA-CLIP]]
- 응용: [[Stable-Diffusion]] · [[DALL-E]] · [[GPT-4V]]
- Adjacent: [[Transformer_Architecture_and_LLM_Foundations|BERT]] · [[Sentence-Transformers]]
## 🤖 LLM 활용
**언제**: 매 multimodal task. 매 image search. 매 zero-shot classification. 매 generation conditioning. 매 LLM 의 vision.
**언제 X**: 매 fine-grained classification (specialty). 매 OCR-heavy.
## ❌ 안티패턴
- **모든 task 의 CLIP**: 매 fine-grained / OCR 의 weak.
- **No domain adapt**: 매 medical / satellite 의 weak.
- **Compositional reasoning expectation**: 매 "red on blue" 의 fail.
- **Counting expectation**: 매 X.
- **Adversarial input 의 trust**: 매 typographic attack.
- **Single template prompt**: 매 ensemble 의 보통 좋음.
## 🧪 검증 / 중복
- Verified (Radford 2021 CLIP, Zhai 2023 SigLIP, OpenCLIP).
- 신뢰도 A.
- Related: [[Stable-Diffusion]] · [[Foundation-Model]] · [[Multimodal-Learning]] · [[Vision-Transformer]] · [[Sentence-Transformers]].
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — variant + InfoNCE + 매 OpenCLIP / SigLIP / LLaVA / OWL-ViT code |