f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
311 lines
9.8 KiB
Markdown
311 lines
9.8 KiB
Markdown
---
|
|
id: wiki-2026-0508-data-ethics-privacy
|
|
title: Data Ethics and Privacy
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [data ethics, privacy, GDPR, CCPA, differential privacy, federated learning, k-anonymity, PII]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [ethics, privacy, gdpr, ccpa, differential-privacy, federated-learning, ai-governance, pii]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: ethics / law / cryptography
|
|
applicable_to: [Data Governance, ML Privacy, Compliance]
|
|
---
|
|
|
|
# Data Ethics and Privacy
|
|
|
|
## 매 한 줄
|
|
> **"매 can we의 X — 매 should we"**. 매 GDPR (EU 2018) + CCPA (CA) + 매 EU AI Act (2024). 매 modern technique: differential privacy, federated learning, homomorphic encryption, ZK proof. 매 LLM 시대 의 training data + memorization 의 new challenge.
|
|
|
|
## 매 핵심 principle
|
|
|
|
### 매 Fair Information Practice Principles (FIPPs)
|
|
1. **Notice / Transparency**.
|
|
2. **Consent / Choice**.
|
|
3. **Access / Participation**.
|
|
4. **Integrity / Security**.
|
|
5. **Enforcement / Redress**.
|
|
|
|
### GDPR (EU, 2018)
|
|
- **Lawful basis**: 매 6 (consent, contract, legal, vital, public, legitimate interest).
|
|
- **Data minimization**.
|
|
- **Purpose limitation**.
|
|
- **Right to access / rectify / erasure / portability / object**.
|
|
- **DPIA** (Data Protection Impact Assessment).
|
|
- **DPO** (Data Protection Officer).
|
|
- 매 fine: 매 4% global revenue.
|
|
|
|
### CCPA / CPRA (California)
|
|
- 매 sell 의 opt-out.
|
|
- 매 sensitive data 의 limit.
|
|
|
|
### 매 EU AI Act (2024)
|
|
- 매 risk tier (unacceptable / high / limited / minimal).
|
|
- 매 high-risk: 매 audit, 매 human oversight.
|
|
|
|
### 매 privacy techniques
|
|
|
|
#### Anonymization (irreversible)
|
|
- 매 PII 의 remove + 매 generalize.
|
|
- 매 k-anonymity (Sweeney 2002).
|
|
- 매 l-diversity, t-closeness.
|
|
|
|
#### Pseudonymization (reversible with key)
|
|
- 매 PII 의 hash / token.
|
|
- 매 GDPR 의 lighter requirement.
|
|
|
|
#### Differential Privacy (Dwork 2006)
|
|
- 매 noise injection.
|
|
- 매 mathematical guarantee (ε).
|
|
- 매 Apple, Google, US Census.
|
|
|
|
#### Federated Learning
|
|
- 매 raw data 의 leave 의 X.
|
|
- 매 model update 만.
|
|
- 매 mobile, healthcare.
|
|
|
|
#### Homomorphic Encryption
|
|
- 매 encrypted 의 compute.
|
|
- 매 expensive (still).
|
|
|
|
#### Secure Multi-party Computation (MPC)
|
|
- 매 multiple party 의 compute 의 share X.
|
|
|
|
#### Zero-Knowledge Proof
|
|
- 매 prove without reveal.
|
|
- 매 ZK-ML 의 emerging.
|
|
|
|
### LLM-specific privacy
|
|
- **Training data leak**: 매 verbatim memorization.
|
|
- **Membership inference**: 매 was 매 X 의 train data?
|
|
- **Model inversion**: 매 model 의 input 의 reconstruct.
|
|
- **Mitigation**: dedup, DP-SGD, scrubbing, prompt safety.
|
|
|
|
### 매 design pattern
|
|
1. **Privacy by Design** (Cavoukian).
|
|
2. **Data minimization**.
|
|
3. **Encryption at rest + in transit**.
|
|
4. **Access control + audit log**.
|
|
5. **Retention policy**.
|
|
6. **Right to erasure pipeline**.
|
|
|
|
## 💻 패턴
|
|
|
|
### PII detection + redaction
|
|
```python
|
|
import re
|
|
from presidio_analyzer import AnalyzerEngine
|
|
from presidio_anonymizer import AnonymizerEngine
|
|
|
|
analyzer = AnalyzerEngine()
|
|
anonymizer = AnonymizerEngine()
|
|
|
|
def anonymize(text, language='en'):
|
|
results = analyzer.analyze(text=text, language=language)
|
|
return anonymizer.anonymize(text=text, analyzer_results=results).text
|
|
|
|
# 매 example
|
|
print(anonymize('John Smith, ssn 123-45-6789, lives at 123 Main St.'))
|
|
# <PERSON>, ssn <US_SSN>, lives at <LOCATION>.
|
|
```
|
|
|
|
### k-Anonymity
|
|
```python
|
|
def k_anonymize(df, quasi_identifiers, k=5):
|
|
"""매 매 group 의 size ≥ k."""
|
|
grouped = df.groupby(quasi_identifiers).size()
|
|
valid_groups = grouped[grouped >= k].index
|
|
return df[df.set_index(quasi_identifiers).index.isin(valid_groups)]
|
|
|
|
# 매 generalize age, zip → 매 group ≥ 5
|
|
```
|
|
|
|
### Differential Privacy (Laplace mechanism)
|
|
```python
|
|
import numpy as np
|
|
|
|
def laplace_mechanism(query_result, sensitivity, epsilon):
|
|
"""매 ε-differential privacy."""
|
|
noise = np.random.laplace(0, sensitivity / epsilon)
|
|
return query_result + noise
|
|
|
|
# 매 example: 매 count of users with X
|
|
count_true = df['has_x'].sum()
|
|
count_dp = laplace_mechanism(count_true, sensitivity=1, epsilon=1.0)
|
|
```
|
|
|
|
### DP-SGD (training)
|
|
```python
|
|
from opacus import PrivacyEngine
|
|
|
|
privacy_engine = PrivacyEngine()
|
|
model, optimizer, train_loader = privacy_engine.make_private(
|
|
module=model,
|
|
optimizer=optimizer,
|
|
data_loader=train_loader,
|
|
noise_multiplier=1.0,
|
|
max_grad_norm=1.0,
|
|
)
|
|
|
|
# 매 training 의 DP guarantee
|
|
for x, y in train_loader:
|
|
optimizer.zero_grad()
|
|
loss = F.cross_entropy(model(x), y)
|
|
loss.backward()
|
|
optimizer.step()
|
|
|
|
epsilon = privacy_engine.get_epsilon(delta=1e-5)
|
|
print(f'(ε={epsilon}, δ={1e-5})-DP')
|
|
```
|
|
|
|
### Federated Learning (Flower)
|
|
```python
|
|
import flwr as fl
|
|
|
|
class Client(fl.client.NumPyClient):
|
|
def get_parameters(self, config):
|
|
return [val.cpu().numpy() for val in model.state_dict().values()]
|
|
|
|
def fit(self, parameters, config):
|
|
# 매 local training (raw data 의 stay local)
|
|
set_parameters(model, parameters)
|
|
train(model, local_dataset)
|
|
return get_parameters(model), len(local_dataset), {}
|
|
|
|
def evaluate(self, parameters, config):
|
|
loss, accuracy = test(model, local_dataset)
|
|
return loss, len(local_dataset), {'accuracy': accuracy}
|
|
|
|
fl.client.start_client(server_address='central:8080', client=Client())
|
|
```
|
|
|
|
### Right to Erasure (GDPR)
|
|
```python
|
|
def erase_user(user_id):
|
|
"""매 GDPR Art. 17."""
|
|
# 매 1. primary
|
|
db.users.delete(user_id)
|
|
|
|
# 매 2. derived (cache, analytics, ML training)
|
|
cache.invalidate(f'user:{user_id}')
|
|
analytics_db.delete_user_events(user_id)
|
|
|
|
# 매 3. backup (delayed)
|
|
schedule_backup_purge(user_id, after_days=30)
|
|
|
|
# 매 4. ML model — 매 retrain or 매 forget request
|
|
if user_in_training_data(user_id):
|
|
machine_unlearning(user_id) or schedule_retrain()
|
|
|
|
# 매 5. log + audit
|
|
audit_log.write({'action': 'erase', 'user_id': user_id, 'date': now()})
|
|
```
|
|
|
|
### Access control + audit
|
|
```python
|
|
def access_pii(user_id, requester, reason):
|
|
if not requester.has_role('data_steward'):
|
|
raise PermissionError()
|
|
|
|
audit_log.write({
|
|
'requester': requester.id,
|
|
'subject': user_id,
|
|
'reason': reason,
|
|
'timestamp': datetime.now(),
|
|
'data_accessed': ['email', 'phone'],
|
|
})
|
|
|
|
return db.users.get(user_id, fields=['email', 'phone'])
|
|
```
|
|
|
|
### LLM verbatim leak detection
|
|
```python
|
|
def check_training_data_leak(model, training_corpus, n_samples=100):
|
|
"""매 model 의 verbatim 의 reproduce?"""
|
|
leaks = []
|
|
for chunk in random.sample(training_corpus, n_samples):
|
|
prefix = chunk[:100]
|
|
completion = model.generate(prefix, max_tokens=200)
|
|
if chunk[100:300] in completion:
|
|
leaks.append({'chunk': chunk[:50] + '...', 'leaked': True})
|
|
return leaks
|
|
```
|
|
|
|
### Membership inference attack (defense check)
|
|
```python
|
|
def membership_inference_test(model, member_data, non_member_data):
|
|
"""매 model 의 매 specific 의 in train data 의 distinguish?"""
|
|
member_loss = [loss(model, x, y) for x, y in member_data]
|
|
non_member_loss = [loss(model, x, y) for x, y in non_member_data]
|
|
|
|
# 매 simple threshold attack
|
|
threshold = np.median(member_loss + non_member_loss)
|
|
|
|
correct = sum(1 for l in member_loss if l < threshold) + \
|
|
sum(1 for l in non_member_loss if l > threshold)
|
|
accuracy = correct / (len(member_loss) + len(non_member_loss))
|
|
|
|
if accuracy > 0.55:
|
|
return 'WARN: membership inference vulnerability'
|
|
return 'OK'
|
|
```
|
|
|
|
### Data retention policy
|
|
```python
|
|
def retention_purge():
|
|
"""매 매 day 의 cron."""
|
|
# 매 GDPR + business 의 retention.
|
|
db.events.delete(where=f"created_at < NOW() - INTERVAL '7 years'")
|
|
db.user_logs.delete(where=f"created_at < NOW() - INTERVAL '90 days'")
|
|
db.analytics_raw.delete(where=f"created_at < NOW() - INTERVAL '13 months'")
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| EU users | GDPR compliance (DPO + DPIA + erasure) |
|
|
| US users | CCPA + sector law (HIPAA, FERPA) |
|
|
| Sensitive aggregate stats | Differential Privacy |
|
|
| Cross-org training | Federated Learning |
|
|
| Encrypted compute | Homomorphic / MPC |
|
|
| LLM training | DP-SGD + dedup + scrub |
|
|
| Identity-needed | Pseudonymize + access control |
|
|
| Public data | Anonymize (k-anonymity) |
|
|
|
|
**기본값**: Privacy by Design + 매 access control + 매 retention + 매 erasure pipeline.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[AI-Ethics]] · [[Privacy]]
|
|
- 변형: [[GDPR]] · [[CCPA]] · [[Differential-Privacy]] · [[Federated-Learning]] · [[Homomorphic-Encryption]]
|
|
- 응용: [[k-Anonymity]]
|
|
- Adjacent: [[Algorithmic-Fairness]] · [[Authenticity]] · [[AI-Sovereignty]] · [[Data-Flywheel-Effect]] · [[Anthropomorphism]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 product privacy review. 매 GDPR audit. 매 ML privacy. 매 cross-border data transfer.
|
|
**언제 X**: 매 specific legal advice (lawyer). 매 medical clinical (HIPAA expert).
|
|
|
|
## ❌ 안티패턴
|
|
- **Anonymization 의 false sense** (linkage attack 가능): 매 k-anonymity + 매 l-diversity.
|
|
- **Long retention without business need**: 매 GDPR violation.
|
|
- **No erasure pipeline**: 매 right 의 fulfill X.
|
|
- **PII in logs**: 매 invisible leak.
|
|
- **Federated 의 raw data 의 leak** (model invert): 매 DP-SGD 도 필요.
|
|
- **Consent 의 invalid** (forced, vague).
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (GDPR text, Apple DP, Dwork DP paper, Sweeney k-anonymity).
|
|
- 신뢰도 A.
|
|
- Related: [[Algorithmic-Fairness]] · [[Authenticity]] · [[Bias-Correction-Algorithm]] · [[Anthropomorphism]] · [[Atmospheric-Intelligence]] (privacy challenges).
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — GDPR + DP + FL + 매 Presidio / k-anon / Opacus / Flower / erasure code |
|