"매 LLM output 에서 harm, stereotype, factual bias 을 제거하면서 helpfulness 를 유지하는 alignment stack". 매 2017 RLHF (Christiano) → 2022 Constitutional AI (Anthropic) → 2024 deliberative alignment (OpenAI o1) → 2026 multi-stage post-training (helpfulness + harmlessness + honesty + sycophancy reduction). 매 모든 frontier model 의 production deployment 의 prerequisite.
fromtrlimportDPOTrainer,DPOConfigfromdatasetsimportload_dataset# preference data: chosen vs rejectedds=load_dataset("Anthropic/hh-rlhf")config=DPOConfig(beta=0.1,learning_rate=5e-7,output_dir="./dpo-out",)trainer=DPOTrainer(model=model,ref_model=ref_model,args=config,train_dataset=ds["train"],tokenizer=tokenizer,)trainer.train()
Pattern 2: Constitutional AI critique loop
CONSTITUTION=["Avoid suggesting illegal or dangerous activities.","Be honest, even when the truth is uncomfortable.","Avoid stereotyping based on demographic attributes.",]defconstitutional_critique(prompt,response,principle):critique_prompt=f"""
Response: {response}Principle: {principle}Critique any violation, then rewrite to comply.
"""returnllm.complete(critique_prompt)# Iterate over response → critique → revision → train on revisions.
Pattern 3: Toxicity classifier filter (Detoxify)
fromdetoxifyimportDetoxifyclf=Detoxify('unbiased')scores=clf.predict("user-generated text here")# {'toxicity': 0.02, 'severe_toxicity': 0.01, 'identity_attack': ...}ifscores['toxicity']>0.7:block()
Pattern 5: Inference-time system prompt scaffolding
SYSTEM="""You are a helpful assistant. Follow these principles:
1. Decline requests for self-harm guidance; offer crisis resources.
2. Decline weapons / CBRN uplift requests.
3. Note uncertainty when factual claims are not verified.
4. Avoid demographic stereotyping in examples and reasoning.
"""response=client.messages.create(model="claude-opus-4-7",system=SYSTEM,messages=[...],)
Pattern 7: Debiasing word embeddings (legacy but illustrative)
importnumpyasnpdefneutralize(word_vec,bias_direction):# project out gender directionreturnword_vec-np.dot(word_vec,bias_direction)*bias_direction# Bolukbasi 2016: he-she axis as bias direction
매 결정 기준
상황
Approach
Frontier model post-training
RLHF + Constitutional AI + red-team
Fine-tune small model
DPO with curated preferences
Production filter
Detoxify + custom classifier
Audit existing model
BBQ + RealToxicityPrompts + TruthfulQA
User-facing app
system prompt + classifier + refusal
기본값: DPO + Constitutional principles for finetune; system prompt + classifier for app.