Files
2nd/10_Wiki/Topics/AI_and_ML/Data-Ethics and Privacy.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

9.8 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-data-ethics-privacy Data Ethics and Privacy 10_Wiki/Topics verified self
data ethics
privacy
GDPR
CCPA
differential privacy
federated learning
k-anonymity
PII
none A 0.9 applied
ethics
privacy
gdpr
ccpa
differential-privacy
federated-learning
ai-governance
pii
2026-05-10 pending
language applicable_to
ethics / law / cryptography
Data Governance
ML Privacy
Compliance

Data Ethics and Privacy

매 한 줄

"매 can we의 X — 매 should we". 매 GDPR (EU 2018) + CCPA (CA) + 매 EU AI Act (2024). 매 modern technique: differential privacy, federated learning, homomorphic encryption, ZK proof. 매 LLM 시대 의 training data + memorization 의 new challenge.

매 핵심 principle

매 Fair Information Practice Principles (FIPPs)

  1. Notice / Transparency.
  2. Consent / Choice.
  3. Access / Participation.
  4. Integrity / Security.
  5. Enforcement / Redress.

GDPR (EU, 2018)

  • Lawful basis: 매 6 (consent, contract, legal, vital, public, legitimate interest).
  • Data minimization.
  • Purpose limitation.
  • Right to access / rectify / erasure / portability / object.
  • DPIA (Data Protection Impact Assessment).
  • DPO (Data Protection Officer).
  • 매 fine: 매 4% global revenue.

CCPA / CPRA (California)

  • 매 sell 의 opt-out.
  • 매 sensitive data 의 limit.

매 EU AI Act (2024)

  • 매 risk tier (unacceptable / high / limited / minimal).
  • 매 high-risk: 매 audit, 매 human oversight.

매 privacy techniques

Anonymization (irreversible)

  • 매 PII 의 remove + 매 generalize.
  • 매 k-anonymity (Sweeney 2002).
  • 매 l-diversity, t-closeness.

Pseudonymization (reversible with key)

  • 매 PII 의 hash / token.
  • 매 GDPR 의 lighter requirement.

Differential Privacy (Dwork 2006)

  • 매 noise injection.
  • 매 mathematical guarantee (ε).
  • 매 Apple, Google, US Census.

Federated Learning

  • 매 raw data 의 leave 의 X.
  • 매 model update 만.
  • 매 mobile, healthcare.

Homomorphic Encryption

  • 매 encrypted 의 compute.
  • 매 expensive (still).

Secure Multi-party Computation (MPC)

  • 매 multiple party 의 compute 의 share X.

Zero-Knowledge Proof

  • 매 prove without reveal.
  • 매 ZK-ML 의 emerging.

LLM-specific privacy

  • Training data leak: 매 verbatim memorization.
  • Membership inference: 매 was 매 X 의 train data?
  • Model inversion: 매 model 의 input 의 reconstruct.
  • Mitigation: dedup, DP-SGD, scrubbing, prompt safety.

매 design pattern

  1. Privacy by Design (Cavoukian).
  2. Data minimization.
  3. Encryption at rest + in transit.
  4. Access control + audit log.
  5. Retention policy.
  6. Right to erasure pipeline.

💻 패턴

PII detection + redaction

import re
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def anonymize(text, language='en'):
    results = analyzer.analyze(text=text, language=language)
    return anonymizer.anonymize(text=text, analyzer_results=results).text

# 매 example
print(anonymize('John Smith, ssn 123-45-6789, lives at 123 Main St.'))
# <PERSON>, ssn <US_SSN>, lives at <LOCATION>.

k-Anonymity

def k_anonymize(df, quasi_identifiers, k=5):
    """매 매 group 의 size ≥ k."""
    grouped = df.groupby(quasi_identifiers).size()
    valid_groups = grouped[grouped >= k].index
    return df[df.set_index(quasi_identifiers).index.isin(valid_groups)]

# 매 generalize age, zip → 매 group ≥ 5

Differential Privacy (Laplace mechanism)

import numpy as np

def laplace_mechanism(query_result, sensitivity, epsilon):
    """매 ε-differential privacy."""
    noise = np.random.laplace(0, sensitivity / epsilon)
    return query_result + noise

# 매 example: 매 count of users with X
count_true = df['has_x'].sum()
count_dp = laplace_mechanism(count_true, sensitivity=1, epsilon=1.0)

DP-SGD (training)

from opacus import PrivacyEngine

privacy_engine = PrivacyEngine()
model, optimizer, train_loader = privacy_engine.make_private(
    module=model,
    optimizer=optimizer,
    data_loader=train_loader,
    noise_multiplier=1.0,
    max_grad_norm=1.0,
)

# 매 training 의 DP guarantee
for x, y in train_loader:
    optimizer.zero_grad()
    loss = F.cross_entropy(model(x), y)
    loss.backward()
    optimizer.step()

epsilon = privacy_engine.get_epsilon(delta=1e-5)
print(f'(ε={epsilon}, δ={1e-5})-DP')

Federated Learning (Flower)

import flwr as fl

class Client(fl.client.NumPyClient):
    def get_parameters(self, config):
        return [val.cpu().numpy() for val in model.state_dict().values()]
    
    def fit(self, parameters, config):
        # 매 local training (raw data 의 stay local)
        set_parameters(model, parameters)
        train(model, local_dataset)
        return get_parameters(model), len(local_dataset), {}
    
    def evaluate(self, parameters, config):
        loss, accuracy = test(model, local_dataset)
        return loss, len(local_dataset), {'accuracy': accuracy}

fl.client.start_client(server_address='central:8080', client=Client())

Right to Erasure (GDPR)

def erase_user(user_id):
    """매 GDPR Art. 17."""
    # 매 1. primary
    db.users.delete(user_id)
    
    # 매 2. derived (cache, analytics, ML training)
    cache.invalidate(f'user:{user_id}')
    analytics_db.delete_user_events(user_id)
    
    # 매 3. backup (delayed)
    schedule_backup_purge(user_id, after_days=30)
    
    # 매 4. ML model — 매 retrain or 매 forget request
    if user_in_training_data(user_id):
        machine_unlearning(user_id) or schedule_retrain()
    
    # 매 5. log + audit
    audit_log.write({'action': 'erase', 'user_id': user_id, 'date': now()})

Access control + audit

def access_pii(user_id, requester, reason):
    if not requester.has_role('data_steward'):
        raise PermissionError()
    
    audit_log.write({
        'requester': requester.id,
        'subject': user_id,
        'reason': reason,
        'timestamp': datetime.now(),
        'data_accessed': ['email', 'phone'],
    })
    
    return db.users.get(user_id, fields=['email', 'phone'])

LLM verbatim leak detection

def check_training_data_leak(model, training_corpus, n_samples=100):
    """매 model 의 verbatim 의 reproduce?"""
    leaks = []
    for chunk in random.sample(training_corpus, n_samples):
        prefix = chunk[:100]
        completion = model.generate(prefix, max_tokens=200)
        if chunk[100:300] in completion:
            leaks.append({'chunk': chunk[:50] + '...', 'leaked': True})
    return leaks

Membership inference attack (defense check)

def membership_inference_test(model, member_data, non_member_data):
    """매 model 의 매 specific 의 in train data 의 distinguish?"""
    member_loss = [loss(model, x, y) for x, y in member_data]
    non_member_loss = [loss(model, x, y) for x, y in non_member_data]
    
    # 매 simple threshold attack
    threshold = np.median(member_loss + non_member_loss)
    
    correct = sum(1 for l in member_loss if l < threshold) + \
              sum(1 for l in non_member_loss if l > threshold)
    accuracy = correct / (len(member_loss) + len(non_member_loss))
    
    if accuracy > 0.55:
        return 'WARN: membership inference vulnerability'
    return 'OK'

Data retention policy

def retention_purge():
    """매 매 day 의 cron."""
    # 매 GDPR + business 의 retention.
    db.events.delete(where=f"created_at < NOW() - INTERVAL '7 years'")
    db.user_logs.delete(where=f"created_at < NOW() - INTERVAL '90 days'")
    db.analytics_raw.delete(where=f"created_at < NOW() - INTERVAL '13 months'")

매 결정 기준

상황 Approach
EU users GDPR compliance (DPO + DPIA + erasure)
US users CCPA + sector law (HIPAA, FERPA)
Sensitive aggregate stats Differential Privacy
Cross-org training Federated Learning
Encrypted compute Homomorphic / MPC
LLM training DP-SGD + dedup + scrub
Identity-needed Pseudonymize + access control
Public data Anonymize (k-anonymity)

기본값: Privacy by Design + 매 access control + 매 retention + 매 erasure pipeline.

🔗 Graph

🤖 LLM 활용

언제: 매 product privacy review. 매 GDPR audit. 매 ML privacy. 매 cross-border data transfer. 언제 X: 매 specific legal advice (lawyer). 매 medical clinical (HIPAA expert).

안티패턴

  • Anonymization 의 false sense (linkage attack 가능): 매 k-anonymity + 매 l-diversity.
  • Long retention without business need: 매 GDPR violation.
  • No erasure pipeline: 매 right 의 fulfill X.
  • PII in logs: 매 invisible leak.
  • Federated 의 raw data 의 leak (model invert): 매 DP-SGD 도 필요.
  • Consent 의 invalid (forced, vague).

🧪 검증 / 중복

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — GDPR + DP + FL + 매 Presidio / k-anon / Opacus / Flower / erasure code