--- id: wiki-2026-0508-clip title: CLIP (Contrastive Language-Image Pre-training) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [CLIP, OpenCLIP, contrastive vision-language, zero-shot image, EVA-CLIP, SigLIP, image-text embedding] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [clip, vision-language, multimodal, contrastive-learning, zero-shot, foundation-model, embedding, openai] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: open_clip / Transformers / CLIP --- # CLIP ## 📌 한 줄 통찰 > **"매 image + text 의 shared embedding"**. 매 contrastive learning + 매 internet-scale (image, caption) pair → 매 zero-shot vision. 매 modern: 매 SigLIP, 매 EVA-CLIP, 매 OpenCLIP. 매 Stable Diffusion / DALL-E / multi-modal LLM 의 vision encoder 의 base. ## 📖 핵심 ### 매 architecture - **Image encoder**: ViT or CNN (ResNet). - **Text encoder**: Transformer. - **Projection**: 매 둘 다 same dim. - 매 cosine similarity. ### 매 training - 매 N (image, text) batch. - 매 cross-entropy on N×N similarity matrix. - 매 diagonal (matched) 의 maximize, 매 off-diagonal 의 minimize. - 매 LAION-5B 등 의 internet pair. ### 매 InfoNCE loss $$L = -\log\frac{e^{sim(I_i, T_i) / \tau}}{\sum_j e^{sim(I_i, T_j) / \tau}}$$ - 매 τ = temperature (learnable). ### 매 zero-shot classification 1. 매 candidate text (prompt): "a photo of a {class}". 2. 매 image embedding. 3. 매 most similar text 의 select. - 매 ImageNet 의 76% (CLIP ViT-L) — 매 supervised 와 비슷. ### 매 variant #### OpenCLIP (LAION) - 매 open-source reproduction. - 매 다양한 size + dataset. #### EVA-CLIP (BAAI) - 매 mask image modeling 의 init. - 매 SOTA at scale. #### SigLIP (Google 2023) - 매 sigmoid loss (vs softmax). - 매 batch size 의 robust. - 매 better at smaller batch. #### MetaCLIP (Meta) - 매 data curation 의 spec. - 매 quality > quantity. #### CLIP-LoRA fine-tune - 매 domain 의 adapt. - 매 few-shot. ### 매 응용 #### Zero-shot classification - 매 ImageNet, 매 medical, 매 satellite. - 매 fine-tune 없이. #### Image search - 매 text → 매 image (e.g., Pinterest). - 매 embedding similarity. #### Stable Diffusion / DALL-E - 매 text encoder. - 매 cross-attention conditioning. #### Multi-modal LLM (LLaVA, GPT-4V) - 매 vision encoder. - 매 LLM 의 input. #### CLIP score (eval) - 매 generated image 의 prompt 의 alignment. #### OWL-ViT (open-vocab detection) - 매 CLIP 의 detection 의 extension. #### Retrieval (CLIP4Clip, ViCLIP) - 매 video 의 text. ### 매 limitation 1. **Compositional**: 매 "red cube on blue ball" 의 weak. 2. **Counting**: 매 "3 dogs" 의 wrong. 3. **OCR**: 매 small text 의 fail (some). 4. **Spatial**: 매 left/right. 5. **Fine-grained**: 매 bird species. 6. **Bias**: 매 web data 의 bias. 7. **Adversarial**: 매 typographic attack. ## 💻 패턴 ### Zero-shot classification (open_clip) ```python import torch import open_clip from PIL import Image model, _, preprocess = open_clip.create_model_and_transforms('ViT-L-14', pretrained='laion2b_s32b_b82k') tokenizer = open_clip.get_tokenizer('ViT-L-14') image = preprocess(Image.open('cat.jpg')).unsqueeze(0) classes = ['a photo of a cat', 'a photo of a dog', 'a photo of a fish'] text = tokenizer(classes) with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text) image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1) for cls, prob in zip(classes, similarity[0]): print(f'{cls}: {prob.item():.3f}') ``` ### Image-text retrieval ```python def search_images(query, image_index, top_k=5): text = tokenizer([query]) with torch.no_grad(): text_emb = model.encode_text(text) text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True) # 매 image_index = (N, D) 매 normalized similarity = text_emb @ image_index.T top_k_idx = similarity[0].topk(top_k).indices return top_k_idx.tolist() ``` ### Fine-tune (CLIP + LoRA) ```python from peft import LoraConfig, get_peft_model import open_clip model, _, _ = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai') lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=['attn.in_proj_weight', 'attn.out_proj'], lora_dropout=0.1, ) model.visual = get_peft_model(model.visual, lora_config) # 매 (image, text) pair 의 contrastive train def contrastive_loss(image_emb, text_emb, temp=0.07): image_emb = image_emb / image_emb.norm(dim=-1, keepdim=True) text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True) logits = (image_emb @ text_emb.T) / temp labels = torch.arange(len(image_emb), device=image_emb.device) return (F.cross_entropy(logits, labels) + F.cross_entropy(logits.T, labels)) / 2 ``` ### CLIP score (eval generated image) ```python def clip_score(image, prompt): img = preprocess(image).unsqueeze(0) txt = tokenizer([prompt]) with torch.no_grad(): img_emb = model.encode_image(img) txt_emb = model.encode_text(txt) img_emb /= img_emb.norm(dim=-1, keepdim=True) txt_emb /= txt_emb.norm(dim=-1, keepdim=True) return (img_emb @ txt_emb.T).item() # 매 0-1 ``` ### SigLIP (Google) ```python from transformers import AutoProcessor, AutoModel processor = AutoProcessor.from_pretrained('google/siglip-large-patch16-384') model = AutoModel.from_pretrained('google/siglip-large-patch16-384') inputs = processor(text=['a cat', 'a dog'], images=image, return_tensors='pt', padding='max_length') outputs = model(**inputs) # 매 sigmoid (independent prob) — 매 softmax 가 X probs = torch.sigmoid(outputs.logits_per_image) ``` ### LLaVA-style (CLIP + LLM) ```python class LLaVA(nn.Module): def __init__(self): self.vision_encoder = CLIPVisionModel.from_pretrained('openai/clip-vit-large-patch14-336') self.projector = nn.Linear(1024, 4096) # 매 image dim → LLM dim self.llm = LlamaForCausalLM.from_pretrained('meta-llama/Llama-3-8B-Instruct') def forward(self, image, text_input_ids): image_features = self.vision_encoder(image).last_hidden_state image_tokens = self.projector(image_features) # 매 text embed + image token 의 concat text_embeds = self.llm.model.embed_tokens(text_input_ids) full_embeds = torch.cat([image_tokens, text_embeds], dim=1) return self.llm(inputs_embeds=full_embeds) ``` ### OWL-ViT (open-vocab detection) ```python from transformers import OwlViTProcessor, OwlViTForObjectDetection processor = OwlViTProcessor.from_pretrained('google/owlvit-base-patch32') model = OwlViTForObjectDetection.from_pretrained('google/owlvit-base-patch32') texts = [['a cat', 'a dog', 'a remote']] inputs = processor(text=texts, images=image, return_tensors='pt') outputs = model(**inputs) results = processor.post_process(outputs=outputs, target_sizes=torch.tensor([image.size[::-1]])) for box, score, label in zip(results[0]['boxes'], results[0]['scores'], results[0]['labels']): if score > 0.5: print(f'{texts[0][label]}: {box}') ``` ## 🤔 결정 기준 | 응용 | Model | |---|---| | Zero-shot class | OpenCLIP / SigLIP | | Image retrieval | CLIP + FAISS | | Generation conditioning | CLIP / T5 (newer) | | Multi-modal LLM | CLIP encoder + LLM | | Open-vocab detection | OWL-ViT / Grounding DINO | | Eval generated image | CLIP score | | Fine-grained | DINOv2 / SigLIP | | Domain adapt | CLIP + LoRA | **기본값**: SigLIP (modern) > OpenCLIP > original CLIP. 매 generation = SDXL / Flux 의 internal. ## 🔗 Graph - 부모: [[Contrastive-Learning]] · [[Foundation-Model]] - 변형: [[OpenCLIP]] · [[SigLIP]] · [[EVA-CLIP]] - 응용: [[Stable-Diffusion]] · [[DALL-E]] · [[GPT-4V]] - Adjacent: [[Transformer_Architecture_and_LLM_Foundations|BERT]] · [[Sentence-Transformers]] ## 🤖 LLM 활용 **언제**: 매 multimodal task. 매 image search. 매 zero-shot classification. 매 generation conditioning. 매 LLM 의 vision. **언제 X**: 매 fine-grained classification (specialty). 매 OCR-heavy. ## ❌ 안티패턴 - **모든 task 의 CLIP**: 매 fine-grained / OCR 의 weak. - **No domain adapt**: 매 medical / satellite 의 weak. - **Compositional reasoning expectation**: 매 "red on blue" 의 fail. - **Counting expectation**: 매 X. - **Adversarial input 의 trust**: 매 typographic attack. - **Single template prompt**: 매 ensemble 의 보통 좋음. ## 🧪 검증 / 중복 - Verified (Radford 2021 CLIP, Zhai 2023 SigLIP, OpenCLIP). - 신뢰도 A. - Related: [[Stable-Diffusion]] · [[Foundation-Model]] · [[Multimodal-Learning]] · [[Vision-Transformer]] · [[Sentence-Transformers]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — variant + InfoNCE + 매 OpenCLIP / SigLIP / LLaVA / OWL-ViT code |