--- id: wiki-2026-0508-bert-language-model title: BERT (Bidirectional Encoder Representations from Transformers) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [BERT, RoBERTa, DeBERTa, ModernBERT, encoder model, MLM, sentence embedding] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [bert, transformer, encoder, mlm, pretraining, fine-tuning, embedding, classification, nlp] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: Transformers (HuggingFace) / PyTorch --- # BERT ## 📌 한 줄 통찰 > **"매 양방향 의 천재"**. 매 left-only 의 LM 의 break — 매 bidirectional context. 매 NLU 의 dominator. 매 GPT 시대 의 generation 의 lose 가, 매 classification / retrieval / embedding 의 still gold standard. 매 ModernBERT (2024) 의 revival. ## 📖 핵심 ### 매 architecture - 매 Transformer encoder 만 (decoder X). - 매 12 layer (base) / 24 (large). - 매 hidden 768 (base) / 1024 (large). - 매 110M (base) / 340M (large) param. ### 매 training objective #### MLM (Masked Language Model) - 매 15% token 의 mask. - 매 80% [MASK], 매 10% random, 매 10% unchanged. - 매 bidirectional context 의 predict. #### NSP (Next Sentence Prediction) - 매 두 sentence 가 이어지는가. - → 매 RoBERTa 가 drop (매 useless). ### 매 input format - `[CLS] sentence A [SEP] sentence B [SEP]` - 매 [CLS] 의 final representation = 매 classification. - 매 segment embedding (A vs B). - 매 position embedding (learned). ### 매 fine-tuning task 1. **Classification**: 매 [CLS] → 매 linear → 매 label. 2. **NER** (token classification): 매 token 별 label. 3. **QA** (extractive): 매 start + end token. 4. **Sentence pair**: 매 NLI, 매 STS. 5. **Embedding**: 매 [CLS] or 매 mean pool. ### 매 variant | Model | 변경 | |---|---| | RoBERTa | 매 NSP X, 매 더 많은 data, 매 dynamic mask | | ALBERT | 매 param share, 매 small | | DistilBERT | 매 distill — 매 60% size | | DeBERTa (v3) | 매 disentangled attention | | ELECTRA | 매 replaced token detection | | ModernBERT (2024) | 매 8K context, 매 GeGLU, 매 fast | ### 매 modern relevance - **Embedding**: sentence-transformers 의 base. - **Classification**: 매 fast + cheap. - **Retrieval**: 매 dense retriever. - **Cross-encoder reranker**: 매 bi-encoder candidate 의 rerank. - **Token-level task**: 매 NER, 매 POS. → 매 GPT 의 substitute X — 매 different niche. ### 매 BERT vs GPT | 측면 | BERT | GPT | |---|---|---| | Architecture | Encoder | Decoder | | Direction | Bidirectional | Causal | | Task | NLU, embed | NLG | | Cost | Low | High | | Latency | Low | High | | Size | 100M-1B | 1B-1T | ## 💻 패턴 ### Classification (HuggingFace) ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments from datasets import load_dataset tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') model = AutoModelForSequenceClassification.from_pretrained( 'bert-base-uncased', num_labels=2, ) dataset = load_dataset('imdb') def tokenize(examples): return tokenizer(examples['text'], truncation=True, padding='max_length') tokenized = dataset.map(tokenize, batched=True) args = TrainingArguments(output_dir='./out', num_train_epochs=3, per_device_train_batch_size=16) trainer = Trainer(model=model, args=args, train_dataset=tokenized['train'], eval_dataset=tokenized['test']) trainer.train() ``` ### NER (token classification) ```python from transformers import AutoModelForTokenClassification model = AutoModelForTokenClassification.from_pretrained( 'bert-base-cased', num_labels=len(label_list), ) # 매 BIO tagging: B-PER, I-PER, B-LOC, ... def tokenize_align_labels(examples): tokenized = tokenizer(examples['tokens'], is_split_into_words=True, truncation=True) labels = [] for i, label in enumerate(examples['ner_tags']): word_ids = tokenized.word_ids(i) aligned = [-100 if w is None else label[w] for w in word_ids] labels.append(aligned) tokenized['labels'] = labels return tokenized ``` ### QA (extractive) ```python from transformers import AutoModelForQuestionAnswering model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased') inputs = tokenizer(question, context, return_tensors='pt') outputs = model(**inputs) start = outputs.start_logits.argmax() end = outputs.end_logits.argmax() answer = tokenizer.decode(inputs['input_ids'][0][start:end+1]) ``` ### Sentence embedding (sentence-transformers) ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') # 매 BERT 변형 embeddings = model.encode(['hello world', 'foo bar']) # (2, 384) # 매 similarity from sklearn.metrics.pairwise import cosine_similarity sim = cosine_similarity([embeddings[0]], [embeddings[1]]) ``` ### Cross-encoder reranker ```python from sentence_transformers import CrossEncoder reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') query = 'What is BERT?' candidates = retriever.search(query, top_k=100) # 매 bi-encoder pairs = [[query, c.text] for c in candidates] scores = reranker.predict(pairs) top10 = sorted(zip(candidates, scores), key=lambda x: -x[1])[:10] ``` ### ModernBERT (2024) ```python from transformers import AutoModel # 매 8K context, 매 GeGLU, 매 RoPE model = AutoModel.from_pretrained('answerdotai/ModernBERT-base') inputs = tokenizer(long_doc, return_tensors='pt', max_length=8192) outputs = model(**inputs) ``` → 매 BERT 의 modern revival — 매 long context. ### LoRA fine-tune (efficient) ```python from peft import LoraConfig, get_peft_model lora = LoraConfig( r=8, lora_alpha=16, target_modules=['query', 'value'], lora_dropout=0.1, bias='none', task_type='SEQ_CLS', ) model = get_peft_model(base_model, lora) # 매 0.1% param 만 학습 ``` ## 🤔 결정 기준 | Task | Model | |---|---| | Classification | BERT / RoBERTa / DeBERTa | | NER / token | BERT / DeBERTa | | QA (extractive) | BERT / RoBERTa | | Sentence similarity | Sentence-BERT (MiniLM, MPNet) | | Retrieval (dense) | DPR / E5 / BGE | | Reranker | Cross-encoder | | Long doc | ModernBERT / Longformer | | Edge / fast | DistilBERT / ALBERT | | Generation | GPT (NOT BERT) | **기본값**: BERT-base 의 classification baseline. 매 retrieval = sentence-transformers. ## 🔗 Graph - 부모: [[Transformer]] - 변형: [[RoBERTa]] · [[DeBERTa]] · [[ModernBERT]] - 응용: [[NER]] · [[Sentence-Embedding]] - Adjacent: [[Sentence-Transformers]] · [[LoRA]] · [[BPE]] ## 🤖 LLM 활용 **언제**: 매 classification. 매 NER. 매 dense retrieval. 매 sentence similarity. 매 fast NLU. **언제 X**: 매 generation. 매 chat. 매 long-context creative. ## ❌ 안티패턴 - **모든 task 의 GPT**: 매 BERT 의 fast / cheap 의 lose. - **No truncation handling**: 매 max_length overflow. - **Mean pool 없 + [CLS] 의 untrained**: 매 weak embedding. - **NSP 의 keep**: 매 useless (RoBERTa lesson). - **No padding mask**: 매 attention pollution. - **Cross-encoder 의 retrieval (large corpus)**: 매 N² cost. ## 🧪 검증 / 중복 - Verified (Devlin et al. 2018, RoBERTa, ModernBERT 2024). - 신뢰도 A. - Related: [[Transformer]] · [[Sentence-Transformers]] · [[ModernBERT]] · [[Fine-Tuning]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — variant + ModernBERT + 매 HF code (classification, NER, QA, embedding, reranker) |