--- id: wiki-2026-0508-data-ethics-privacy title: Data Ethics and Privacy category: 10_Wiki/Topics status: verified canonical_id: self aliases: [data ethics, privacy, GDPR, CCPA, differential privacy, federated learning, k-anonymity, PII] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [ethics, privacy, gdpr, ccpa, differential-privacy, federated-learning, ai-governance, pii] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: ethics / law / cryptography applicable_to: [Data Governance, ML Privacy, Compliance] --- # Data Ethics and Privacy ## 매 한 줄 > **"매 can we의 X — 매 should we"**. 매 GDPR (EU 2018) + CCPA (CA) + 매 EU AI Act (2024). 매 modern technique: differential privacy, federated learning, homomorphic encryption, ZK proof. 매 LLM 시대 의 training data + memorization 의 new challenge. ## 매 핵심 principle ### 매 Fair Information Practice Principles (FIPPs) 1. **Notice / Transparency**. 2. **Consent / Choice**. 3. **Access / Participation**. 4. **Integrity / Security**. 5. **Enforcement / Redress**. ### GDPR (EU, 2018) - **Lawful basis**: 매 6 (consent, contract, legal, vital, public, legitimate interest). - **Data minimization**. - **Purpose limitation**. - **Right to access / rectify / erasure / portability / object**. - **DPIA** (Data Protection Impact Assessment). - **DPO** (Data Protection Officer). - 매 fine: 매 4% global revenue. ### CCPA / CPRA (California) - 매 sell 의 opt-out. - 매 sensitive data 의 limit. ### 매 EU AI Act (2024) - 매 risk tier (unacceptable / high / limited / minimal). - 매 high-risk: 매 audit, 매 human oversight. ### 매 privacy techniques #### Anonymization (irreversible) - 매 PII 의 remove + 매 generalize. - 매 k-anonymity (Sweeney 2002). - 매 l-diversity, t-closeness. #### Pseudonymization (reversible with key) - 매 PII 의 hash / token. - 매 GDPR 의 lighter requirement. #### Differential Privacy (Dwork 2006) - 매 noise injection. - 매 mathematical guarantee (ε). - 매 Apple, Google, US Census. #### Federated Learning - 매 raw data 의 leave 의 X. - 매 model update 만. - 매 mobile, healthcare. #### Homomorphic Encryption - 매 encrypted 의 compute. - 매 expensive (still). #### Secure Multi-party Computation (MPC) - 매 multiple party 의 compute 의 share X. #### Zero-Knowledge Proof - 매 prove without reveal. - 매 ZK-ML 의 emerging. ### LLM-specific privacy - **Training data leak**: 매 verbatim memorization. - **Membership inference**: 매 was 매 X 의 train data? - **Model inversion**: 매 model 의 input 의 reconstruct. - **Mitigation**: dedup, DP-SGD, scrubbing, prompt safety. ### 매 design pattern 1. **Privacy by Design** (Cavoukian). 2. **Data minimization**. 3. **Encryption at rest + in transit**. 4. **Access control + audit log**. 5. **Retention policy**. 6. **Right to erasure pipeline**. ## 💻 패턴 ### PII detection + redaction ```python import re from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine analyzer = AnalyzerEngine() anonymizer = AnonymizerEngine() def anonymize(text, language='en'): results = analyzer.analyze(text=text, language=language) return anonymizer.anonymize(text=text, analyzer_results=results).text # 매 example print(anonymize('John Smith, ssn 123-45-6789, lives at 123 Main St.')) # , ssn , lives at . ``` ### k-Anonymity ```python def k_anonymize(df, quasi_identifiers, k=5): """매 매 group 의 size ≥ k.""" grouped = df.groupby(quasi_identifiers).size() valid_groups = grouped[grouped >= k].index return df[df.set_index(quasi_identifiers).index.isin(valid_groups)] # 매 generalize age, zip → 매 group ≥ 5 ``` ### Differential Privacy (Laplace mechanism) ```python import numpy as np def laplace_mechanism(query_result, sensitivity, epsilon): """매 ε-differential privacy.""" noise = np.random.laplace(0, sensitivity / epsilon) return query_result + noise # 매 example: 매 count of users with X count_true = df['has_x'].sum() count_dp = laplace_mechanism(count_true, sensitivity=1, epsilon=1.0) ``` ### DP-SGD (training) ```python from opacus import PrivacyEngine privacy_engine = PrivacyEngine() model, optimizer, train_loader = privacy_engine.make_private( module=model, optimizer=optimizer, data_loader=train_loader, noise_multiplier=1.0, max_grad_norm=1.0, ) # 매 training 의 DP guarantee for x, y in train_loader: optimizer.zero_grad() loss = F.cross_entropy(model(x), y) loss.backward() optimizer.step() epsilon = privacy_engine.get_epsilon(delta=1e-5) print(f'(ε={epsilon}, δ={1e-5})-DP') ``` ### Federated Learning (Flower) ```python import flwr as fl class Client(fl.client.NumPyClient): def get_parameters(self, config): return [val.cpu().numpy() for val in model.state_dict().values()] def fit(self, parameters, config): # 매 local training (raw data 의 stay local) set_parameters(model, parameters) train(model, local_dataset) return get_parameters(model), len(local_dataset), {} def evaluate(self, parameters, config): loss, accuracy = test(model, local_dataset) return loss, len(local_dataset), {'accuracy': accuracy} fl.client.start_client(server_address='central:8080', client=Client()) ``` ### Right to Erasure (GDPR) ```python def erase_user(user_id): """매 GDPR Art. 17.""" # 매 1. primary db.users.delete(user_id) # 매 2. derived (cache, analytics, ML training) cache.invalidate(f'user:{user_id}') analytics_db.delete_user_events(user_id) # 매 3. backup (delayed) schedule_backup_purge(user_id, after_days=30) # 매 4. ML model — 매 retrain or 매 forget request if user_in_training_data(user_id): machine_unlearning(user_id) or schedule_retrain() # 매 5. log + audit audit_log.write({'action': 'erase', 'user_id': user_id, 'date': now()}) ``` ### Access control + audit ```python def access_pii(user_id, requester, reason): if not requester.has_role('data_steward'): raise PermissionError() audit_log.write({ 'requester': requester.id, 'subject': user_id, 'reason': reason, 'timestamp': datetime.now(), 'data_accessed': ['email', 'phone'], }) return db.users.get(user_id, fields=['email', 'phone']) ``` ### LLM verbatim leak detection ```python def check_training_data_leak(model, training_corpus, n_samples=100): """매 model 의 verbatim 의 reproduce?""" leaks = [] for chunk in random.sample(training_corpus, n_samples): prefix = chunk[:100] completion = model.generate(prefix, max_tokens=200) if chunk[100:300] in completion: leaks.append({'chunk': chunk[:50] + '...', 'leaked': True}) return leaks ``` ### Membership inference attack (defense check) ```python def membership_inference_test(model, member_data, non_member_data): """매 model 의 매 specific 의 in train data 의 distinguish?""" member_loss = [loss(model, x, y) for x, y in member_data] non_member_loss = [loss(model, x, y) for x, y in non_member_data] # 매 simple threshold attack threshold = np.median(member_loss + non_member_loss) correct = sum(1 for l in member_loss if l < threshold) + \ sum(1 for l in non_member_loss if l > threshold) accuracy = correct / (len(member_loss) + len(non_member_loss)) if accuracy > 0.55: return 'WARN: membership inference vulnerability' return 'OK' ``` ### Data retention policy ```python def retention_purge(): """매 매 day 의 cron.""" # 매 GDPR + business 의 retention. db.events.delete(where=f"created_at < NOW() - INTERVAL '7 years'") db.user_logs.delete(where=f"created_at < NOW() - INTERVAL '90 days'") db.analytics_raw.delete(where=f"created_at < NOW() - INTERVAL '13 months'") ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | EU users | GDPR compliance (DPO + DPIA + erasure) | | US users | CCPA + sector law (HIPAA, FERPA) | | Sensitive aggregate stats | Differential Privacy | | Cross-org training | Federated Learning | | Encrypted compute | Homomorphic / MPC | | LLM training | DP-SGD + dedup + scrub | | Identity-needed | Pseudonymize + access control | | Public data | Anonymize (k-anonymity) | **기본값**: Privacy by Design + 매 access control + 매 retention + 매 erasure pipeline. ## 🔗 Graph - 부모: [[AI-Ethics]] · [[Privacy]] - 변형: [[GDPR]] · [[CCPA]] · [[Differential-Privacy]] · [[Federated-Learning]] · [[Homomorphic-Encryption]] - 응용: [[k-Anonymity]] - Adjacent: [[Algorithmic-Fairness]] · [[Authenticity]] · [[AI-Sovereignty]] · [[Data-Flywheel-Effect]] · [[Anthropomorphism]] ## 🤖 LLM 활용 **언제**: 매 product privacy review. 매 GDPR audit. 매 ML privacy. 매 cross-border data transfer. **언제 X**: 매 specific legal advice (lawyer). 매 medical clinical (HIPAA expert). ## ❌ 안티패턴 - **Anonymization 의 false sense** (linkage attack 가능): 매 k-anonymity + 매 l-diversity. - **Long retention without business need**: 매 GDPR violation. - **No erasure pipeline**: 매 right 의 fulfill X. - **PII in logs**: 매 invisible leak. - **Federated 의 raw data 의 leak** (model invert): 매 DP-SGD 도 필요. - **Consent 의 invalid** (forced, vague). ## 🧪 검증 / 중복 - Verified (GDPR text, Apple DP, Dwork DP paper, Sweeney k-anonymity). - 신뢰도 A. - Related: [[Algorithmic-Fairness]] · [[Authenticity]] · [[Bias-Correction-Algorithm]] · [[Anthropomorphism]] · [[Atmospheric-Intelligence]] (privacy challenges). ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — GDPR + DP + FL + 매 Presidio / k-anon / Opacus / Flower / erasure code |