f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
280 lines
8.9 KiB
Markdown
280 lines
8.9 KiB
Markdown
---
|
||
id: wiki-2026-0508-clip
|
||
title: CLIP (Contrastive Language-Image Pre-training)
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [CLIP, OpenCLIP, contrastive vision-language, zero-shot image, EVA-CLIP, SigLIP, image-text embedding]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.95
|
||
verification_status: applied
|
||
tags: [clip, vision-language, multimodal, contrastive-learning, zero-shot, foundation-model, embedding, openai]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: Python
|
||
framework: open_clip / Transformers / CLIP
|
||
---
|
||
|
||
# CLIP
|
||
|
||
## 📌 한 줄 통찰
|
||
> **"매 image + text 의 shared embedding"**. 매 contrastive learning + 매 internet-scale (image, caption) pair → 매 zero-shot vision. 매 modern: 매 SigLIP, 매 EVA-CLIP, 매 OpenCLIP. 매 Stable Diffusion / DALL-E / multi-modal LLM 의 vision encoder 의 base.
|
||
|
||
## 📖 핵심
|
||
|
||
### 매 architecture
|
||
- **Image encoder**: ViT or CNN (ResNet).
|
||
- **Text encoder**: Transformer.
|
||
- **Projection**: 매 둘 다 same dim.
|
||
- 매 cosine similarity.
|
||
|
||
### 매 training
|
||
- 매 N (image, text) batch.
|
||
- 매 cross-entropy on N×N similarity matrix.
|
||
- 매 diagonal (matched) 의 maximize, 매 off-diagonal 의 minimize.
|
||
- 매 LAION-5B 등 의 internet pair.
|
||
|
||
### 매 InfoNCE loss
|
||
$$L = -\log\frac{e^{sim(I_i, T_i) / \tau}}{\sum_j e^{sim(I_i, T_j) / \tau}}$$
|
||
|
||
- 매 τ = temperature (learnable).
|
||
|
||
### 매 zero-shot classification
|
||
1. 매 candidate text (prompt): "a photo of a {class}".
|
||
2. 매 image embedding.
|
||
3. 매 most similar text 의 select.
|
||
- 매 ImageNet 의 76% (CLIP ViT-L) — 매 supervised 와 비슷.
|
||
|
||
### 매 variant
|
||
|
||
#### OpenCLIP (LAION)
|
||
- 매 open-source reproduction.
|
||
- 매 다양한 size + dataset.
|
||
|
||
#### EVA-CLIP (BAAI)
|
||
- 매 mask image modeling 의 init.
|
||
- 매 SOTA at scale.
|
||
|
||
#### SigLIP (Google 2023)
|
||
- 매 sigmoid loss (vs softmax).
|
||
- 매 batch size 의 robust.
|
||
- 매 better at smaller batch.
|
||
|
||
#### MetaCLIP (Meta)
|
||
- 매 data curation 의 spec.
|
||
- 매 quality > quantity.
|
||
|
||
#### CLIP-LoRA fine-tune
|
||
- 매 domain 의 adapt.
|
||
- 매 few-shot.
|
||
|
||
### 매 응용
|
||
|
||
#### Zero-shot classification
|
||
- 매 ImageNet, 매 medical, 매 satellite.
|
||
- 매 fine-tune 없이.
|
||
|
||
#### Image search
|
||
- 매 text → 매 image (e.g., Pinterest).
|
||
- 매 embedding similarity.
|
||
|
||
#### Stable Diffusion / DALL-E
|
||
- 매 text encoder.
|
||
- 매 cross-attention conditioning.
|
||
|
||
#### Multi-modal LLM (LLaVA, GPT-4V)
|
||
- 매 vision encoder.
|
||
- 매 LLM 의 input.
|
||
|
||
#### CLIP score (eval)
|
||
- 매 generated image 의 prompt 의 alignment.
|
||
|
||
#### OWL-ViT (open-vocab detection)
|
||
- 매 CLIP 의 detection 의 extension.
|
||
|
||
#### Retrieval (CLIP4Clip, ViCLIP)
|
||
- 매 video 의 text.
|
||
|
||
### 매 limitation
|
||
1. **Compositional**: 매 "red cube on blue ball" 의 weak.
|
||
2. **Counting**: 매 "3 dogs" 의 wrong.
|
||
3. **OCR**: 매 small text 의 fail (some).
|
||
4. **Spatial**: 매 left/right.
|
||
5. **Fine-grained**: 매 bird species.
|
||
6. **Bias**: 매 web data 의 bias.
|
||
7. **Adversarial**: 매 typographic attack.
|
||
|
||
## 💻 패턴
|
||
|
||
### Zero-shot classification (open_clip)
|
||
```python
|
||
import torch
|
||
import open_clip
|
||
from PIL import Image
|
||
|
||
model, _, preprocess = open_clip.create_model_and_transforms('ViT-L-14', pretrained='laion2b_s32b_b82k')
|
||
tokenizer = open_clip.get_tokenizer('ViT-L-14')
|
||
|
||
image = preprocess(Image.open('cat.jpg')).unsqueeze(0)
|
||
classes = ['a photo of a cat', 'a photo of a dog', 'a photo of a fish']
|
||
text = tokenizer(classes)
|
||
|
||
with torch.no_grad():
|
||
image_features = model.encode_image(image)
|
||
text_features = model.encode_text(text)
|
||
image_features /= image_features.norm(dim=-1, keepdim=True)
|
||
text_features /= text_features.norm(dim=-1, keepdim=True)
|
||
|
||
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
|
||
|
||
for cls, prob in zip(classes, similarity[0]):
|
||
print(f'{cls}: {prob.item():.3f}')
|
||
```
|
||
|
||
### Image-text retrieval
|
||
```python
|
||
def search_images(query, image_index, top_k=5):
|
||
text = tokenizer([query])
|
||
with torch.no_grad():
|
||
text_emb = model.encode_text(text)
|
||
text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True)
|
||
|
||
# 매 image_index = (N, D) 매 normalized
|
||
similarity = text_emb @ image_index.T
|
||
top_k_idx = similarity[0].topk(top_k).indices
|
||
return top_k_idx.tolist()
|
||
```
|
||
|
||
### Fine-tune (CLIP + LoRA)
|
||
```python
|
||
from peft import LoraConfig, get_peft_model
|
||
import open_clip
|
||
|
||
model, _, _ = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
|
||
|
||
lora_config = LoraConfig(
|
||
r=16, lora_alpha=32,
|
||
target_modules=['attn.in_proj_weight', 'attn.out_proj'],
|
||
lora_dropout=0.1,
|
||
)
|
||
model.visual = get_peft_model(model.visual, lora_config)
|
||
|
||
# 매 (image, text) pair 의 contrastive train
|
||
def contrastive_loss(image_emb, text_emb, temp=0.07):
|
||
image_emb = image_emb / image_emb.norm(dim=-1, keepdim=True)
|
||
text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True)
|
||
|
||
logits = (image_emb @ text_emb.T) / temp
|
||
labels = torch.arange(len(image_emb), device=image_emb.device)
|
||
return (F.cross_entropy(logits, labels) + F.cross_entropy(logits.T, labels)) / 2
|
||
```
|
||
|
||
### CLIP score (eval generated image)
|
||
```python
|
||
def clip_score(image, prompt):
|
||
img = preprocess(image).unsqueeze(0)
|
||
txt = tokenizer([prompt])
|
||
with torch.no_grad():
|
||
img_emb = model.encode_image(img)
|
||
txt_emb = model.encode_text(txt)
|
||
img_emb /= img_emb.norm(dim=-1, keepdim=True)
|
||
txt_emb /= txt_emb.norm(dim=-1, keepdim=True)
|
||
return (img_emb @ txt_emb.T).item() # 매 0-1
|
||
```
|
||
|
||
### SigLIP (Google)
|
||
```python
|
||
from transformers import AutoProcessor, AutoModel
|
||
|
||
processor = AutoProcessor.from_pretrained('google/siglip-large-patch16-384')
|
||
model = AutoModel.from_pretrained('google/siglip-large-patch16-384')
|
||
|
||
inputs = processor(text=['a cat', 'a dog'], images=image, return_tensors='pt', padding='max_length')
|
||
outputs = model(**inputs)
|
||
|
||
# 매 sigmoid (independent prob) — 매 softmax 가 X
|
||
probs = torch.sigmoid(outputs.logits_per_image)
|
||
```
|
||
|
||
### LLaVA-style (CLIP + LLM)
|
||
```python
|
||
class LLaVA(nn.Module):
|
||
def __init__(self):
|
||
self.vision_encoder = CLIPVisionModel.from_pretrained('openai/clip-vit-large-patch14-336')
|
||
self.projector = nn.Linear(1024, 4096) # 매 image dim → LLM dim
|
||
self.llm = LlamaForCausalLM.from_pretrained('meta-llama/Llama-3-8B-Instruct')
|
||
|
||
def forward(self, image, text_input_ids):
|
||
image_features = self.vision_encoder(image).last_hidden_state
|
||
image_tokens = self.projector(image_features)
|
||
|
||
# 매 text embed + image token 의 concat
|
||
text_embeds = self.llm.model.embed_tokens(text_input_ids)
|
||
full_embeds = torch.cat([image_tokens, text_embeds], dim=1)
|
||
|
||
return self.llm(inputs_embeds=full_embeds)
|
||
```
|
||
|
||
### OWL-ViT (open-vocab detection)
|
||
```python
|
||
from transformers import OwlViTProcessor, OwlViTForObjectDetection
|
||
|
||
processor = OwlViTProcessor.from_pretrained('google/owlvit-base-patch32')
|
||
model = OwlViTForObjectDetection.from_pretrained('google/owlvit-base-patch32')
|
||
|
||
texts = [['a cat', 'a dog', 'a remote']]
|
||
inputs = processor(text=texts, images=image, return_tensors='pt')
|
||
outputs = model(**inputs)
|
||
|
||
results = processor.post_process(outputs=outputs, target_sizes=torch.tensor([image.size[::-1]]))
|
||
for box, score, label in zip(results[0]['boxes'], results[0]['scores'], results[0]['labels']):
|
||
if score > 0.5:
|
||
print(f'{texts[0][label]}: {box}')
|
||
```
|
||
|
||
## 🤔 결정 기준
|
||
| 응용 | Model |
|
||
|---|---|
|
||
| Zero-shot class | OpenCLIP / SigLIP |
|
||
| Image retrieval | CLIP + FAISS |
|
||
| Generation conditioning | CLIP / T5 (newer) |
|
||
| Multi-modal LLM | CLIP encoder + LLM |
|
||
| Open-vocab detection | OWL-ViT / Grounding DINO |
|
||
| Eval generated image | CLIP score |
|
||
| Fine-grained | DINOv2 / SigLIP |
|
||
| Domain adapt | CLIP + LoRA |
|
||
|
||
**기본값**: SigLIP (modern) > OpenCLIP > original CLIP. 매 generation = SDXL / Flux 의 internal.
|
||
|
||
## 🔗 Graph
|
||
- 부모: [[Contrastive-Learning]] · [[Foundation-Model]]
|
||
- 변형: [[OpenCLIP]] · [[SigLIP]] · [[EVA-CLIP]]
|
||
- 응용: [[Stable-Diffusion]] · [[DALL-E]] · [[GPT-4V]]
|
||
- Adjacent: [[Transformer_Architecture_and_LLM_Foundations|BERT]] · [[Sentence-Transformers]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: 매 multimodal task. 매 image search. 매 zero-shot classification. 매 generation conditioning. 매 LLM 의 vision.
|
||
**언제 X**: 매 fine-grained classification (specialty). 매 OCR-heavy.
|
||
|
||
## ❌ 안티패턴
|
||
- **모든 task 의 CLIP**: 매 fine-grained / OCR 의 weak.
|
||
- **No domain adapt**: 매 medical / satellite 의 weak.
|
||
- **Compositional reasoning expectation**: 매 "red on blue" 의 fail.
|
||
- **Counting expectation**: 매 X.
|
||
- **Adversarial input 의 trust**: 매 typographic attack.
|
||
- **Single template prompt**: 매 ensemble 의 보통 좋음.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (Radford 2021 CLIP, Zhai 2023 SigLIP, OpenCLIP).
|
||
- 신뢰도 A.
|
||
- Related: [[Stable-Diffusion]] · [[Foundation-Model]] · [[Multimodal-Learning]] · [[Vision-Transformer]] · [[Sentence-Transformers]].
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — variant + InfoNCE + 매 OpenCLIP / SigLIP / LLaVA / OWL-ViT code |
|