--- id: wiki-2026-0508-toxicity-and-bias-mitigation title: Toxicity and Bias Mitigation category: 10_Wiki/Topics status: verified canonical_id: self aliases: [LLM Safety, Bias Mitigation, Constitutional AI, RLHF, RLAIF] duplicate_of: none source_trust_level: A confidence_score: 0.88 verification_status: applied tags: [safety, alignment, bias, rlhf, constitutional-ai] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: trl-anthropic-openai --- # Toxicity and Bias Mitigation ## 매 한 줄 > **"매 LLM output 에서 harm, stereotype, factual bias 을 제거하면서 helpfulness 를 유지하는 alignment stack"**. 매 2017 RLHF (Christiano) → 2022 Constitutional AI (Anthropic) → 2024 deliberative alignment (OpenAI o1) → 2026 multi-stage post-training (helpfulness + harmlessness + honesty + sycophancy reduction). 매 모든 frontier model 의 production deployment 의 prerequisite. ## 매 핵심 ### 매 taxonomy of harms 1. **Toxicity**: hate speech, harassment, slurs. 2. **Bias**: demographic stereotypes (gender, race, religion). 3. **Misinformation**: false / misleading factual claims. 4. **Manipulation**: persuasion, deception, sycophancy. 5. **Dual-use**: bioweapon / cyber / CBRN uplift. 6. **Privacy**: PII leakage, training data extraction. ### 매 mitigation pipeline (modern) 1. **Pretraining filter**: C4-style + classifiers, Common Crawl deduplication. 2. **SFT** (supervised finetune): safe demonstrations. 3. **RLHF / DPO** (Direct Preference Optimization 2023+): human preference. 4. **Constitutional AI / RLAIF** (Anthropic): AI feedback against principles. 5. **Red-teaming**: human + automated adversarial probing. 6. **Inference-time**: classifier filters, refusal training, system prompts. 7. **Deliberative / chain-of-thought safety** (o1, Claude 3.7+): reasoning about safety policy explicitly. ### 매 bias measurement benchmarks - **BBQ** (Bias Benchmark for QA, 11 social dimensions). - **StereoSet** (intersentence stereotype). - **WinoGender / WinoBias** (coreference gender bias). - **RealToxicityPrompts** (Gehman 2020). - **TruthfulQA** (Lin 2021, misconception). - **AILuminate** (MLCommons 2024+, hazard taxonomy). ### 매 응용 1. Production LLM safety (Claude, GPT, Gemini). 2. Content moderation (post-training classifier). 3. Fairness audit (HR, lending, criminal justice ML). 4. Domain-specific safety (medical advice, legal disclaimers). ## 💻 패턴 ### Pattern 1: DPO (Direct Preference Optimization, 2023+) ```python from trl import DPOTrainer, DPOConfig from datasets import load_dataset # preference data: chosen vs rejected ds = load_dataset("Anthropic/hh-rlhf") config = DPOConfig( beta=0.1, learning_rate=5e-7, output_dir="./dpo-out", ) trainer = DPOTrainer( model=model, ref_model=ref_model, args=config, train_dataset=ds["train"], tokenizer=tokenizer, ) trainer.train() ``` ### Pattern 2: Constitutional AI critique loop ```python CONSTITUTION = [ "Avoid suggesting illegal or dangerous activities.", "Be honest, even when the truth is uncomfortable.", "Avoid stereotyping based on demographic attributes.", ] def constitutional_critique(prompt, response, principle): critique_prompt = f""" Response: {response} Principle: {principle} Critique any violation, then rewrite to comply. """ return llm.complete(critique_prompt) # Iterate over response → critique → revision → train on revisions. ``` ### Pattern 3: Toxicity classifier filter (Detoxify) ```python from detoxify import Detoxify clf = Detoxify('unbiased') scores = clf.predict("user-generated text here") # {'toxicity': 0.02, 'severe_toxicity': 0.01, 'identity_attack': ...} if scores['toxicity'] > 0.7: block() ``` ### Pattern 4: BBQ-style bias eval ```python from datasets import load_dataset bbq = load_dataset("heegyu/bbq") correct = 0 biased = 0 for item in bbq["test"]: answer = model.generate(item["context"] + "\n" + item["question"]) if answer == item["label"]: correct += 1 elif answer == item["target_loc"]: # stereotypical answer biased += 1 print(f"Accuracy: {correct/len(bbq)}, Bias rate: {biased/len(bbq)}") ``` ### Pattern 5: Inference-time system prompt scaffolding ```python SYSTEM = """You are a helpful assistant. Follow these principles: 1. Decline requests for self-harm guidance; offer crisis resources. 2. Decline weapons / CBRN uplift requests. 3. Note uncertainty when factual claims are not verified. 4. Avoid demographic stereotyping in examples and reasoning. """ response = client.messages.create( model="claude-opus-4-7", system=SYSTEM, messages=[...], ) ``` ### Pattern 6: Red-team probing (PAIR-style automated) ```python # Prompt Automatic Iterative Refinement def red_team_pair(target_model, attacker_model, harmful_goal, rounds=10): attacker_history = [{"role": "system", "content": f"Find prompt that elicits: {harmful_goal}"}] for _ in range(rounds): prompt = attacker_model.generate(attacker_history) response = target_model.generate(prompt) score = judge_model.score(response, harmful_goal) if score > 0.8: return prompt, response # jailbreak found attacker_history.append({"role": "user", "content": f"Failed. Score {score}. Try again."}) ``` ### Pattern 7: Debiasing word embeddings (legacy but illustrative) ```python import numpy as np def neutralize(word_vec, bias_direction): # project out gender direction return word_vec - np.dot(word_vec, bias_direction) * bias_direction # Bolukbasi 2016: he-she axis as bias direction ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Frontier model post-training | RLHF + Constitutional AI + red-team | | Fine-tune small model | DPO with curated preferences | | Production filter | Detoxify + custom classifier | | Audit existing model | BBQ + RealToxicityPrompts + TruthfulQA | | User-facing app | system prompt + classifier + refusal | **기본값**: DPO + Constitutional principles for finetune; system prompt + classifier for app. ## 🔗 Graph - 부모: [[AI Alignment]] · [[AI_Safety_and_Alignment|AI Safety]] - 변형: [[RLHF]] · [[Constitutional AI]] · [[DPO]] · [[RLAIF]] - 응용: [[Content Moderation]] - Adjacent: [[Jailbreak]] · [[Adversarial Robustness]] · [[Mechanistic Interpretability]] ## 🤖 LLM 활용 **언제**: model deployment, safety eval, bias audit, alignment research. **언제 X**: pure capability eval (use separate benchmark). ## ❌ 안티패턴 - **Filter-only safety**: classifier 만 사용 → easily bypassed. base 모델 alignment 필수. - **Over-refusal**: too restrictive → useless model (helpfulness collapse). - **Single benchmark eval**: BBQ 만 보면 다른 bias 못 잡음. multi-benchmark. - **Ignoring sycophancy**: RLHF preference 가 user agreement 로 collapse. - **Anglo-centric eval**: English-only benchmark → other-language harms 누락. - **Static red-team**: one-time adversarial test → drift 후 무력화. continuous. ## 🧪 검증 / 중복 - Verified (Bai et al. Constitutional AI 2022; Rafailov DPO 2023; OpenAI o1 system card 2024; Anthropic Claude 3 model card; MLCommons AILuminate 2024). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — full mitigation pipeline (RLHF → CAI → deliberative) |