Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

8.9 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

CLIP

📌 한 줄 통찰

"매 image + text 의 shared embedding". 매 contrastive learning + 매 internet-scale (image, caption) pair → 매 zero-shot vision. 매 modern: 매 SigLIP, 매 EVA-CLIP, 매 OpenCLIP. 매 Stable Diffusion / DALL-E / multi-modal LLM 의 vision encoder 의 base.

📖 핵심

매 architecture

Image encoder: ViT or CNN (ResNet).
Text encoder: Transformer.
Projection: 매 둘 다 same dim.
매 cosine similarity.

매 training

매 N (image, text) batch.
매 cross-entropy on N×N similarity matrix.
매 diagonal (matched) 의 maximize, 매 off-diagonal 의 minimize.
매 LAION-5B 등 의 internet pair.

매 InfoNCE loss

L = -\log\frac{e^{sim(I_i, T_i) / \tau}}{\sum_j e^{sim(I_i, T_j) / \tau}}

매 τ = temperature (learnable).

매 zero-shot classification

매 candidate text (prompt): "a photo of a {class}".
매 image embedding.
매 most similar text 의 select.

매 ImageNet 의 76% (CLIP ViT-L) — 매 supervised 와 비슷.

매 variant

OpenCLIP (LAION)

매 open-source reproduction.
매 다양한 size + dataset.

EVA-CLIP (BAAI)

매 mask image modeling 의 init.
매 SOTA at scale.

SigLIP (Google 2023)

매 sigmoid loss (vs softmax).
매 batch size 의 robust.
매 better at smaller batch.

MetaCLIP (Meta)

매 data curation 의 spec.
매 quality > quantity.

CLIP-LoRA fine-tune

매 domain 의 adapt.
매 few-shot.

매 응용

Zero-shot classification

매 ImageNet, 매 medical, 매 satellite.
매 fine-tune 없이.

Image search

매 text → 매 image (e.g., Pinterest).
매 embedding similarity.

Stable Diffusion / DALL-E

매 text encoder.
매 cross-attention conditioning.

매 vision encoder.
매 LLM 의 input.

CLIP score (eval)

매 generated image 의 prompt 의 alignment.

OWL-ViT (open-vocab detection)

매 CLIP 의 detection 의 extension.

Retrieval (CLIP4Clip, ViCLIP)

매 video 의 text.

매 limitation

Compositional: 매 "red cube on blue ball" 의 weak.
Counting: 매 "3 dogs" 의 wrong.
OCR: 매 small text 의 fail (some).
Spatial: 매 left/right.
Fine-grained: 매 bird species.
Bias: 매 web data 의 bias.
Adversarial: 매 typographic attack.

💻 패턴

Zero-shot classification (open_clip)

import torch
import open_clip
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms('ViT-L-14', pretrained='laion2b_s32b_b82k')
tokenizer = open_clip.get_tokenizer('ViT-L-14')

image = preprocess(Image.open('cat.jpg')).unsqueeze(0)
classes = ['a photo of a cat', 'a photo of a dog', 'a photo of a fish']
text = tokenizer(classes)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

for cls, prob in zip(classes, similarity[0]):
    print(f'{cls}: {prob.item():.3f}')

Image-text retrieval

def search_images(query, image_index, top_k=5):
    text = tokenizer([query])
    with torch.no_grad():
        text_emb = model.encode_text(text)
        text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True)
    
    # 매 image_index = (N, D) 매 normalized
    similarity = text_emb @ image_index.T
    top_k_idx = similarity[0].topk(top_k).indices
    return top_k_idx.tolist()

Fine-tune (CLIP + LoRA)

from peft import LoraConfig, get_peft_model
import open_clip

model, _, _ = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')

lora_config = LoraConfig(
    r=16, lora_alpha=32,
    target_modules=['attn.in_proj_weight', 'attn.out_proj'],
    lora_dropout=0.1,
)
model.visual = get_peft_model(model.visual, lora_config)

# 매 (image, text) pair 의 contrastive train
def contrastive_loss(image_emb, text_emb, temp=0.07):
    image_emb = image_emb / image_emb.norm(dim=-1, keepdim=True)
    text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True)
    
    logits = (image_emb @ text_emb.T) / temp
    labels = torch.arange(len(image_emb), device=image_emb.device)
    return (F.cross_entropy(logits, labels) + F.cross_entropy(logits.T, labels)) / 2

CLIP score (eval generated image)

def clip_score(image, prompt):
    img = preprocess(image).unsqueeze(0)
    txt = tokenizer([prompt])
    with torch.no_grad():
        img_emb = model.encode_image(img)
        txt_emb = model.encode_text(txt)
        img_emb /= img_emb.norm(dim=-1, keepdim=True)
        txt_emb /= txt_emb.norm(dim=-1, keepdim=True)
    return (img_emb @ txt_emb.T).item()  # 매 0-1

SigLIP (Google)

from transformers import AutoProcessor, AutoModel

processor = AutoProcessor.from_pretrained('google/siglip-large-patch16-384')
model = AutoModel.from_pretrained('google/siglip-large-patch16-384')

inputs = processor(text=['a cat', 'a dog'], images=image, return_tensors='pt', padding='max_length')
outputs = model(**inputs)

# 매 sigmoid (independent prob) — 매 softmax 가 X
probs = torch.sigmoid(outputs.logits_per_image)

LLaVA-style (CLIP + LLM)

class LLaVA(nn.Module):
    def __init__(self):
        self.vision_encoder = CLIPVisionModel.from_pretrained('openai/clip-vit-large-patch14-336')
        self.projector = nn.Linear(1024, 4096)  # 매 image dim → LLM dim
        self.llm = LlamaForCausalLM.from_pretrained('meta-llama/Llama-3-8B-Instruct')
    
    def forward(self, image, text_input_ids):
        image_features = self.vision_encoder(image).last_hidden_state
        image_tokens = self.projector(image_features)
        
        # 매 text embed + image token 의 concat
        text_embeds = self.llm.model.embed_tokens(text_input_ids)
        full_embeds = torch.cat([image_tokens, text_embeds], dim=1)
        
        return self.llm(inputs_embeds=full_embeds)

OWL-ViT (open-vocab detection)

from transformers import OwlViTProcessor, OwlViTForObjectDetection

processor = OwlViTProcessor.from_pretrained('google/owlvit-base-patch32')
model = OwlViTForObjectDetection.from_pretrained('google/owlvit-base-patch32')

texts = [['a cat', 'a dog', 'a remote']]
inputs = processor(text=texts, images=image, return_tensors='pt')
outputs = model(**inputs)

results = processor.post_process(outputs=outputs, target_sizes=torch.tensor([image.size[::-1]]))
for box, score, label in zip(results[0]['boxes'], results[0]['scores'], results[0]['labels']):
    if score > 0.5:
        print(f'{texts[0][label]}: {box}')

🤔 결정 기준

응용	Model
Zero-shot class	OpenCLIP / SigLIP
Image retrieval	CLIP + FAISS
Generation conditioning	CLIP / T5 (newer)
Multi-modal LLM	CLIP encoder + LLM
Open-vocab detection	OWL-ViT / Grounding DINO
Eval generated image	CLIP score
Fine-grained	DINOv2 / SigLIP
Domain adapt	CLIP + LoRA

기본값: SigLIP (modern) > OpenCLIP > original CLIP. 매 generation = SDXL / Flux 의 internal.

🔗 Graph

부모: Contrastive-Learning · Foundation-Model
변형: OpenCLIP · SigLIP · EVA-CLIP
응용: Stable-Diffusion · DALL-E · GPT-4V
Adjacent: Transformer_Architecture_and_LLM_Foundations · Sentence-Transformers

🤖 LLM 활용

언제: 매 multimodal task. 매 image search. 매 zero-shot classification. 매 generation conditioning. 매 LLM 의 vision. 언제 X: 매 fine-grained classification (specialty). 매 OCR-heavy.

❌ 안티패턴

모든 task 의 CLIP: 매 fine-grained / OCR 의 weak.
No domain adapt: 매 medical / satellite 의 weak.
Compositional reasoning expectation: 매 "red on blue" 의 fail.
Counting expectation: 매 X.
Adversarial input 의 trust: 매 typographic attack.
Single template prompt: 매 ensemble 의 보통 좋음.

🧪 검증 / 중복

Verified (Radford 2021 CLIP, Zhai 2023 SigLIP, OpenCLIP).
신뢰도 A.
Related: Stable-Diffusion · Foundation-Model · Multimodal-Learning · Vision-Transformer · Sentence-Transformers.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — variant + InfoNCE + 매 OpenCLIP / SigLIP / LLaVA / OWL-ViT code

8.9 KiB Raw Blame History Unescape Escape

CLIP

📌 한 줄 통찰

📖 핵심

매 architecture

매 training

매 InfoNCE loss

매 zero-shot classification

매 variant

OpenCLIP (LAION)

EVA-CLIP (BAAI)

SigLIP (Google 2023)

MetaCLIP (Meta)

CLIP-LoRA fine-tune

매 응용

Zero-shot classification

Image search

Stable Diffusion / DALL-E

Multi-modal LLM (LLaVA, GPT-4V)

CLIP score (eval)

OWL-ViT (open-vocab detection)

Retrieval (CLIP4Clip, ViCLIP)

매 limitation

💻 패턴

Zero-shot classification (open_clip)

Image-text retrieval

Fine-tune (CLIP + LoRA)

CLIP score (eval generated image)

SigLIP (Google)

LLaVA-style (CLIP + LLM)

OWL-ViT (open-vocab detection)

🤔 결정 기준

🔗 Graph

🤖 LLM 활용

❌ 안티패턴

🧪 검증 / 중복

🕓 Changelog

8.9 KiB

Raw Blame History