Files
2nd/10_Wiki/Topics/AI_and_ML/Bert-Language-Model.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

7.4 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-bert-language-model BERT (Bidirectional Encoder Representations from Transformers) 10_Wiki/Topics verified self
BERT
RoBERTa
DeBERTa
ModernBERT
encoder model
MLM
sentence embedding
none A 0.95 applied
bert
transformer
encoder
mlm
pretraining
fine-tuning
embedding
classification
nlp
2026-05-10 pending
language framework
Python Transformers (HuggingFace) / PyTorch

BERT

📌 한 줄 통찰

"매 양방향 의 천재". 매 left-only 의 LM 의 break — 매 bidirectional context. 매 NLU 의 dominator. 매 GPT 시대 의 generation 의 lose 가, 매 classification / retrieval / embedding 의 still gold standard. 매 ModernBERT (2024) 의 revival.

📖 핵심

매 architecture

  • 매 Transformer encoder 만 (decoder X).
  • 매 12 layer (base) / 24 (large).
  • 매 hidden 768 (base) / 1024 (large).
  • 매 110M (base) / 340M (large) param.

매 training objective

MLM (Masked Language Model)

  • 매 15% token 의 mask.
  • 매 80% [MASK], 매 10% random, 매 10% unchanged.
  • 매 bidirectional context 의 predict.

NSP (Next Sentence Prediction)

  • 매 두 sentence 가 이어지는가.
  • → 매 RoBERTa 가 drop (매 useless).

매 input format

  • [CLS] sentence A [SEP] sentence B [SEP]
  • 매 [CLS] 의 final representation = 매 classification.
  • 매 segment embedding (A vs B).
  • 매 position embedding (learned).

매 fine-tuning task

  1. Classification: 매 [CLS] → 매 linear → 매 label.
  2. NER (token classification): 매 token 별 label.
  3. QA (extractive): 매 start + end token.
  4. Sentence pair: 매 NLI, 매 STS.
  5. Embedding: 매 [CLS] or 매 mean pool.

매 variant

Model 변경
RoBERTa 매 NSP X, 매 더 많은 data, 매 dynamic mask
ALBERT 매 param share, 매 small
DistilBERT 매 distill — 매 60% size
DeBERTa (v3) 매 disentangled attention
ELECTRA 매 replaced token detection
ModernBERT (2024) 매 8K context, 매 GeGLU, 매 fast

매 modern relevance

  • Embedding: sentence-transformers 의 base.
  • Classification: 매 fast + cheap.
  • Retrieval: 매 dense retriever.
  • Cross-encoder reranker: 매 bi-encoder candidate 의 rerank.
  • Token-level task: 매 NER, 매 POS.

→ 매 GPT 의 substitute X — 매 different niche.

매 BERT vs GPT

측면 BERT GPT
Architecture Encoder Decoder
Direction Bidirectional Causal
Task NLU, embed NLG
Cost Low High
Latency Low High
Size 100M-1B 1B-1T

💻 패턴

Classification (HuggingFace)

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=2,
)

dataset = load_dataset('imdb')
def tokenize(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length')
tokenized = dataset.map(tokenize, batched=True)

args = TrainingArguments(output_dir='./out', num_train_epochs=3, per_device_train_batch_size=16)
trainer = Trainer(model=model, args=args, train_dataset=tokenized['train'], eval_dataset=tokenized['test'])
trainer.train()

NER (token classification)

from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    'bert-base-cased', num_labels=len(label_list),
)

# 매 BIO tagging: B-PER, I-PER, B-LOC, ...
def tokenize_align_labels(examples):
    tokenized = tokenizer(examples['tokens'], is_split_into_words=True, truncation=True)
    labels = []
    for i, label in enumerate(examples['ner_tags']):
        word_ids = tokenized.word_ids(i)
        aligned = [-100 if w is None else label[w] for w in word_ids]
        labels.append(aligned)
    tokenized['labels'] = labels
    return tokenized

QA (extractive)

from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')

inputs = tokenizer(question, context, return_tensors='pt')
outputs = model(**inputs)
start = outputs.start_logits.argmax()
end = outputs.end_logits.argmax()
answer = tokenizer.decode(inputs['input_ids'][0][start:end+1])

Sentence embedding (sentence-transformers)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')  # 매 BERT 변형
embeddings = model.encode(['hello world', 'foo bar'])
# (2, 384)

# 매 similarity
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity([embeddings[0]], [embeddings[1]])

Cross-encoder reranker

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

query = 'What is BERT?'
candidates = retriever.search(query, top_k=100)  # 매 bi-encoder
pairs = [[query, c.text] for c in candidates]
scores = reranker.predict(pairs)
top10 = sorted(zip(candidates, scores), key=lambda x: -x[1])[:10]

ModernBERT (2024)

from transformers import AutoModel

# 매 8K context, 매 GeGLU, 매 RoPE
model = AutoModel.from_pretrained('answerdotai/ModernBERT-base')
inputs = tokenizer(long_doc, return_tensors='pt', max_length=8192)
outputs = model(**inputs)

→ 매 BERT 의 modern revival — 매 long context.

LoRA fine-tune (efficient)

from peft import LoraConfig, get_peft_model

lora = LoraConfig(
    r=8, lora_alpha=16, target_modules=['query', 'value'],
    lora_dropout=0.1, bias='none', task_type='SEQ_CLS',
)
model = get_peft_model(base_model, lora)
# 매 0.1% param 만 학습

🤔 결정 기준

Task Model
Classification BERT / RoBERTa / DeBERTa
NER / token BERT / DeBERTa
QA (extractive) BERT / RoBERTa
Sentence similarity Sentence-BERT (MiniLM, MPNet)
Retrieval (dense) DPR / E5 / BGE
Reranker Cross-encoder
Long doc ModernBERT / Longformer
Edge / fast DistilBERT / ALBERT
Generation GPT (NOT BERT)

기본값: BERT-base 의 classification baseline. 매 retrieval = sentence-transformers.

🔗 Graph

🤖 LLM 활용

언제: 매 classification. 매 NER. 매 dense retrieval. 매 sentence similarity. 매 fast NLU. 언제 X: 매 generation. 매 chat. 매 long-context creative.

안티패턴

  • 모든 task 의 GPT: 매 BERT 의 fast / cheap 의 lose.
  • No truncation handling: 매 max_length overflow.
  • Mean pool 없 + [CLS] 의 untrained: 매 weak embedding.
  • NSP 의 keep: 매 useless (RoBERTa lesson).
  • No padding mask: 매 attention pollution.
  • Cross-encoder 의 retrieval (large corpus): 매 N² cost.

🧪 검증 / 중복

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — variant + ModernBERT + 매 HF code (classification, NER, QA, embedding, reranker)