BERT (Bidirectional Encoder Representations from Transformers)
10_Wiki/Topics
verified
self
BERT
RoBERTa
DeBERTa
ModernBERT
encoder model
MLM
sentence embedding
none
A
0.95
applied
bert
transformer
encoder
mlm
pretraining
fine-tuning
embedding
classification
nlp
2026-05-10
pending
language
framework
Python
Transformers (HuggingFace) / PyTorch
BERT
📌 한 줄 통찰
"매 양방향 의 천재". 매 left-only 의 LM 의 break — 매 bidirectional context. 매 NLU 의 dominator. 매 GPT 시대 의 generation 의 lose 가, 매 classification / retrieval / embedding 의 still gold standard. 매 ModernBERT (2024) 의 revival.
📖 핵심
매 architecture
매 Transformer encoder 만 (decoder X).
매 12 layer (base) / 24 (large).
매 hidden 768 (base) / 1024 (large).
매 110M (base) / 340M (large) param.
매 training objective
MLM (Masked Language Model)
매 15% token 의 mask.
매 80% [MASK], 매 10% random, 매 10% unchanged.
매 bidirectional context 의 predict.
NSP (Next Sentence Prediction)
매 두 sentence 가 이어지는가.
→ 매 RoBERTa 가 drop (매 useless).
매 input format
[CLS] sentence A [SEP] sentence B [SEP]
매 [CLS] 의 final representation = 매 classification.
매 segment embedding (A vs B).
매 position embedding (learned).
매 fine-tuning task
Classification: 매 [CLS] → 매 linear → 매 label.
NER (token classification): 매 token 별 label.
QA (extractive): 매 start + end token.
Sentence pair: 매 NLI, 매 STS.
Embedding: 매 [CLS] or 매 mean pool.
매 variant
Model
변경
RoBERTa
매 NSP X, 매 더 많은 data, 매 dynamic mask
ALBERT
매 param share, 매 small
DistilBERT
매 distill — 매 60% size
DeBERTa (v3)
매 disentangled attention
ELECTRA
매 replaced token detection
ModernBERT (2024)
매 8K context, 매 GeGLU, 매 fast
매 modern relevance
Embedding: sentence-transformers 의 base.
Classification: 매 fast + cheap.
Retrieval: 매 dense retriever.
Cross-encoder reranker: 매 bi-encoder candidate 의 rerank.
fromtransformersimportAutoModelForTokenClassificationmodel=AutoModelForTokenClassification.from_pretrained('bert-base-cased',num_labels=len(label_list),)# 매 BIO tagging: B-PER, I-PER, B-LOC, ...deftokenize_align_labels(examples):tokenized=tokenizer(examples['tokens'],is_split_into_words=True,truncation=True)labels=[]fori,labelinenumerate(examples['ner_tags']):word_ids=tokenized.word_ids(i)aligned=[-100ifwisNoneelselabel[w]forwinword_ids]labels.append(aligned)tokenized['labels']=labelsreturntokenized
fromsentence_transformersimportSentenceTransformermodel=SentenceTransformer('all-MiniLM-L6-v2')# 매 BERT 변형embeddings=model.encode(['hello world','foo bar'])# (2, 384)# 매 similarityfromsklearn.metrics.pairwiseimportcosine_similaritysim=cosine_similarity([embeddings[0]],[embeddings[1]])
Cross-encoder reranker
fromsentence_transformersimportCrossEncoderreranker=CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')query='What is BERT?'candidates=retriever.search(query,top_k=100)# 매 bi-encoderpairs=[[query,c.text]forcincandidates]scores=reranker.predict(pairs)top10=sorted(zip(candidates,scores),key=lambdax:-x[1])[:10]
ModernBERT (2024)
fromtransformersimportAutoModel# 매 8K context, 매 GeGLU, 매 RoPEmodel=AutoModel.from_pretrained('answerdotai/ModernBERT-base')inputs=tokenizer(long_doc,return_tensors='pt',max_length=8192)outputs=model(**inputs)
→ 매 BERT 의 modern revival — 매 long context.
LoRA fine-tune (efficient)
frompeftimportLoraConfig,get_peft_modellora=LoraConfig(r=8,lora_alpha=16,target_modules=['query','value'],lora_dropout=0.1,bias='none',task_type='SEQ_CLS',)model=get_peft_model(base_model,lora)# 매 0.1% param 만 학습
🤔 결정 기준
Task
Model
Classification
BERT / RoBERTa / DeBERTa
NER / token
BERT / DeBERTa
QA (extractive)
BERT / RoBERTa
Sentence similarity
Sentence-BERT (MiniLM, MPNet)
Retrieval (dense)
DPR / E5 / BGE
Reranker
Cross-encoder
Long doc
ModernBERT / Longformer
Edge / fast
DistilBERT / ALBERT
Generation
GPT (NOT BERT)
기본값: BERT-base 의 classification baseline. 매 retrieval = sentence-transformers.