f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
235 lines
7.4 KiB
Markdown
235 lines
7.4 KiB
Markdown
---
|
|
id: wiki-2026-0508-bert-language-model
|
|
title: BERT (Bidirectional Encoder Representations from Transformers)
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [BERT, RoBERTa, DeBERTa, ModernBERT, encoder model, MLM, sentence embedding]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.95
|
|
verification_status: applied
|
|
tags: [bert, transformer, encoder, mlm, pretraining, fine-tuning, embedding, classification, nlp]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: Python
|
|
framework: Transformers (HuggingFace) / PyTorch
|
|
---
|
|
|
|
# BERT
|
|
|
|
## 📌 한 줄 통찰
|
|
> **"매 양방향 의 천재"**. 매 left-only 의 LM 의 break — 매 bidirectional context. 매 NLU 의 dominator. 매 GPT 시대 의 generation 의 lose 가, 매 classification / retrieval / embedding 의 still gold standard. 매 ModernBERT (2024) 의 revival.
|
|
|
|
## 📖 핵심
|
|
|
|
### 매 architecture
|
|
- 매 Transformer encoder 만 (decoder X).
|
|
- 매 12 layer (base) / 24 (large).
|
|
- 매 hidden 768 (base) / 1024 (large).
|
|
- 매 110M (base) / 340M (large) param.
|
|
|
|
### 매 training objective
|
|
|
|
#### MLM (Masked Language Model)
|
|
- 매 15% token 의 mask.
|
|
- 매 80% [MASK], 매 10% random, 매 10% unchanged.
|
|
- 매 bidirectional context 의 predict.
|
|
|
|
#### NSP (Next Sentence Prediction)
|
|
- 매 두 sentence 가 이어지는가.
|
|
- → 매 RoBERTa 가 drop (매 useless).
|
|
|
|
### 매 input format
|
|
- `[CLS] sentence A [SEP] sentence B [SEP]`
|
|
- 매 [CLS] 의 final representation = 매 classification.
|
|
- 매 segment embedding (A vs B).
|
|
- 매 position embedding (learned).
|
|
|
|
### 매 fine-tuning task
|
|
1. **Classification**: 매 [CLS] → 매 linear → 매 label.
|
|
2. **NER** (token classification): 매 token 별 label.
|
|
3. **QA** (extractive): 매 start + end token.
|
|
4. **Sentence pair**: 매 NLI, 매 STS.
|
|
5. **Embedding**: 매 [CLS] or 매 mean pool.
|
|
|
|
### 매 variant
|
|
| Model | 변경 |
|
|
|---|---|
|
|
| RoBERTa | 매 NSP X, 매 더 많은 data, 매 dynamic mask |
|
|
| ALBERT | 매 param share, 매 small |
|
|
| DistilBERT | 매 distill — 매 60% size |
|
|
| DeBERTa (v3) | 매 disentangled attention |
|
|
| ELECTRA | 매 replaced token detection |
|
|
| ModernBERT (2024) | 매 8K context, 매 GeGLU, 매 fast |
|
|
|
|
### 매 modern relevance
|
|
- **Embedding**: sentence-transformers 의 base.
|
|
- **Classification**: 매 fast + cheap.
|
|
- **Retrieval**: 매 dense retriever.
|
|
- **Cross-encoder reranker**: 매 bi-encoder candidate 의 rerank.
|
|
- **Token-level task**: 매 NER, 매 POS.
|
|
|
|
→ 매 GPT 의 substitute X — 매 different niche.
|
|
|
|
### 매 BERT vs GPT
|
|
| 측면 | BERT | GPT |
|
|
|---|---|---|
|
|
| Architecture | Encoder | Decoder |
|
|
| Direction | Bidirectional | Causal |
|
|
| Task | NLU, embed | NLG |
|
|
| Cost | Low | High |
|
|
| Latency | Low | High |
|
|
| Size | 100M-1B | 1B-1T |
|
|
|
|
## 💻 패턴
|
|
|
|
### Classification (HuggingFace)
|
|
```python
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
|
|
from datasets import load_dataset
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
|
|
model = AutoModelForSequenceClassification.from_pretrained(
|
|
'bert-base-uncased', num_labels=2,
|
|
)
|
|
|
|
dataset = load_dataset('imdb')
|
|
def tokenize(examples):
|
|
return tokenizer(examples['text'], truncation=True, padding='max_length')
|
|
tokenized = dataset.map(tokenize, batched=True)
|
|
|
|
args = TrainingArguments(output_dir='./out', num_train_epochs=3, per_device_train_batch_size=16)
|
|
trainer = Trainer(model=model, args=args, train_dataset=tokenized['train'], eval_dataset=tokenized['test'])
|
|
trainer.train()
|
|
```
|
|
|
|
### NER (token classification)
|
|
```python
|
|
from transformers import AutoModelForTokenClassification
|
|
|
|
model = AutoModelForTokenClassification.from_pretrained(
|
|
'bert-base-cased', num_labels=len(label_list),
|
|
)
|
|
|
|
# 매 BIO tagging: B-PER, I-PER, B-LOC, ...
|
|
def tokenize_align_labels(examples):
|
|
tokenized = tokenizer(examples['tokens'], is_split_into_words=True, truncation=True)
|
|
labels = []
|
|
for i, label in enumerate(examples['ner_tags']):
|
|
word_ids = tokenized.word_ids(i)
|
|
aligned = [-100 if w is None else label[w] for w in word_ids]
|
|
labels.append(aligned)
|
|
tokenized['labels'] = labels
|
|
return tokenized
|
|
```
|
|
|
|
### QA (extractive)
|
|
```python
|
|
from transformers import AutoModelForQuestionAnswering
|
|
|
|
model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')
|
|
|
|
inputs = tokenizer(question, context, return_tensors='pt')
|
|
outputs = model(**inputs)
|
|
start = outputs.start_logits.argmax()
|
|
end = outputs.end_logits.argmax()
|
|
answer = tokenizer.decode(inputs['input_ids'][0][start:end+1])
|
|
```
|
|
|
|
### Sentence embedding (sentence-transformers)
|
|
```python
|
|
from sentence_transformers import SentenceTransformer
|
|
|
|
model = SentenceTransformer('all-MiniLM-L6-v2') # 매 BERT 변형
|
|
embeddings = model.encode(['hello world', 'foo bar'])
|
|
# (2, 384)
|
|
|
|
# 매 similarity
|
|
from sklearn.metrics.pairwise import cosine_similarity
|
|
sim = cosine_similarity([embeddings[0]], [embeddings[1]])
|
|
```
|
|
|
|
### Cross-encoder reranker
|
|
```python
|
|
from sentence_transformers import CrossEncoder
|
|
|
|
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
|
|
|
|
query = 'What is BERT?'
|
|
candidates = retriever.search(query, top_k=100) # 매 bi-encoder
|
|
pairs = [[query, c.text] for c in candidates]
|
|
scores = reranker.predict(pairs)
|
|
top10 = sorted(zip(candidates, scores), key=lambda x: -x[1])[:10]
|
|
```
|
|
|
|
### ModernBERT (2024)
|
|
```python
|
|
from transformers import AutoModel
|
|
|
|
# 매 8K context, 매 GeGLU, 매 RoPE
|
|
model = AutoModel.from_pretrained('answerdotai/ModernBERT-base')
|
|
inputs = tokenizer(long_doc, return_tensors='pt', max_length=8192)
|
|
outputs = model(**inputs)
|
|
```
|
|
|
|
→ 매 BERT 의 modern revival — 매 long context.
|
|
|
|
### LoRA fine-tune (efficient)
|
|
```python
|
|
from peft import LoraConfig, get_peft_model
|
|
|
|
lora = LoraConfig(
|
|
r=8, lora_alpha=16, target_modules=['query', 'value'],
|
|
lora_dropout=0.1, bias='none', task_type='SEQ_CLS',
|
|
)
|
|
model = get_peft_model(base_model, lora)
|
|
# 매 0.1% param 만 학습
|
|
```
|
|
|
|
## 🤔 결정 기준
|
|
| Task | Model |
|
|
|---|---|
|
|
| Classification | BERT / RoBERTa / DeBERTa |
|
|
| NER / token | BERT / DeBERTa |
|
|
| QA (extractive) | BERT / RoBERTa |
|
|
| Sentence similarity | Sentence-BERT (MiniLM, MPNet) |
|
|
| Retrieval (dense) | DPR / E5 / BGE |
|
|
| Reranker | Cross-encoder |
|
|
| Long doc | ModernBERT / Longformer |
|
|
| Edge / fast | DistilBERT / ALBERT |
|
|
| Generation | GPT (NOT BERT) |
|
|
|
|
**기본값**: BERT-base 의 classification baseline. 매 retrieval = sentence-transformers.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Transformer]]
|
|
- 변형: [[RoBERTa]] · [[DeBERTa]] · [[ModernBERT]]
|
|
- 응용: [[NER]] · [[Sentence-Embedding]]
|
|
- Adjacent: [[Sentence-Transformers]] · [[LoRA]] · [[BPE]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 classification. 매 NER. 매 dense retrieval. 매 sentence similarity. 매 fast NLU.
|
|
**언제 X**: 매 generation. 매 chat. 매 long-context creative.
|
|
|
|
## ❌ 안티패턴
|
|
- **모든 task 의 GPT**: 매 BERT 의 fast / cheap 의 lose.
|
|
- **No truncation handling**: 매 max_length overflow.
|
|
- **Mean pool 없 + [CLS] 의 untrained**: 매 weak embedding.
|
|
- **NSP 의 keep**: 매 useless (RoBERTa lesson).
|
|
- **No padding mask**: 매 attention pollution.
|
|
- **Cross-encoder 의 retrieval (large corpus)**: 매 N² cost.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Devlin et al. 2018, RoBERTa, ModernBERT 2024).
|
|
- 신뢰도 A.
|
|
- Related: [[Transformer]] · [[Sentence-Transformers]] · [[ModernBERT]] · [[Fine-Tuning]].
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — variant + ModernBERT + 매 HF code (classification, NER, QA, embedding, reranker) |
|