[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -2,89 +2,173 @@
 id: wiki-2026-0508-privacy-preserving-ai
 title: Privacy Preserving AI
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [SEC-PRIV-TECH-001]
+aliases: [Privacy-Preserving Machine Learning, PPML, Confidential AI]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 1.0
-tags: [ai, security, privacy-preserving-ai, differential-privacy, Homomorphic-Encryption, Federated-Learning, smpc]
+confidence_score: 0.9
+verification_status: applied
+tags: [privacy, security, differential-privacy, federated-learning, cryptography]
 raw_sources: []
-last_reinforced: 2026-04-26
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
 tech_stack:
-  language: unspecified
-  framework: unspecified
+  language: Python
+  framework: Opacus / TF-Federated / TenSEAL / PySyft
 ---

-# Privacy-Preserving AI (프라이버시 보존 AI)
+# Privacy Preserving AI

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "데이터를 '보지 않고도' 배우고, 정보를 '공유하지 않고도' 지혜를 나누는 암호학적 지능을 구축하라" — 데이터의 기밀성을 유지하면서도 인공지능 모델의 학습과 추론이 가능하도록 설계된 암호학 및 통계학적 기술 체계.
+## 매 한 줄
+> **"매 train and infer on data without exposing it — 4 pillars: DP, FL, HE, MPC"**. GDPR (2018) 와 healthcare/finance regulation 으로 driven, 2024 EU AI Act 와 US executive orders 로 mainstream. 2026 currently confidential computing (TEE: Intel TDX, NVIDIA H100 CC, Apple PCC) 가 production deployment 의 default.

-## 📖 구조화된 지식 (Synthesized Content)
- **추출된 패턴:** "Confidential Computation and Decoupled Learning" — 원본 데이터에 접근하는 대신 암호화된 상태에서 연산하거나(Homomorphic), 데이터의 로컬 경계를 넘지 않은 채 모델의 가중치만 교환함으로써(Federated) 정보 노출의 원천을 차단하는 패턴.
- **4대 핵심 기술:**
-    - **Differential Privacy (차분 프라이버시):** 데이터에 수학적 노이즈를 추가하여 개별 샘플 추론 방지.
-    - **Homomorphic Encryption (동형 암호):** 암호화된 데이터 위에서 직접 연산을 수행.
-    - **Federated Learning (연합 학습):** 분산된 장치에서 학습 후 결과만 취합.
-    - **SMPC (Secure Multi-party Computation):** 여러 참여자가 데이터를 비밀리에 나누어 공동 연산.
- **의의:** 의료, 금융 등 민감한 데이터를 다루는 산업에서 AI 도입의 가장 큰 장벽인 보안 문제를 해결하고, 데이터의 가치만 활용하는 '신뢰할 수 있는 AI' 구현의 핵심 기술.
+## 매 핵심

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌:** 보안을 강화하면 연산 속도가 수백 배 느려진다는 초기 한계를 넘어, 최근에는 전용 가속기(TEE)와 최적화된 암호 프로토콜을 통해 실용적인 수준의 성능을 확보하는 단계에 도달함.
- **정책 변화:** Antigravity 프로젝트는 에이전트 간의 지식 공유 시, 사용자별 민감 데이터가 섞이지 않도록 차분 프라이버시 원칙이 적용된 지식 인덱싱 아키텍처를 지향함.
+### 매 4 pillars
+1. **Differential Privacy (DP)**: noise 추가 to bound info leakage. Calibrated by epsilon (ε).
+2. **Federated Learning (FL)**: model goes to data, not data to model.
+3. **Homomorphic Encryption (HE)**: compute on ciphertext directly.
+4. **Secure Multi-Party Computation (MPC)**: parties jointly compute without revealing inputs.

-## 🔗 지식 연결 (Graph)
- [[Personal-Information-Security|Personal-Information-Security]], [[Trustworthy-AI|Trustworthy-AI]], [[Local-Brain-Management|Local-Brain-Management]], Cloud-Computing-Foundations
- **Raw Source:** 10_Wiki/Topics/AI/Privacy-Preserving-AI.md
+### 매 production additions (2024-2026)
+- **TEE / Confidential computing**: Intel TDX, AMD SEV-SNP, NVIDIA H100 confidential GPU, Apple Private Cloud Compute.
+- **Synthetic data**: GAN/diffusion-generated; near-zero re-id risk if done right.
+- **Machine unlearning**: GDPR right-to-be-forgotten compliance.

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+### 매 trade-offs
+| Method | Privacy | Utility | Compute | Deployed? |
+|---|---|---|---|---|
+| DP-SGD (ε≈1) | High | -2 to -5% acc | 2-5x | Yes (Apple, Google) |
+| Federated | Medium | ~same | High comm | Yes (Gboard, healthcare) |
+| HE (CKKS) | Very high | exact | 1000-10000x | Niche |
+| MPC | Very high | exact | 100-1000x | Niche |
+| TEE | High (HW trust) | ~same | ~1.1x | Rapidly growing |

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+## 💻 패턴

-**언제 쓰면 안 되는가:**
- *(TODO)*
+### DP-SGD with Opacus (PyTorch)
+```python
+from opacus import PrivacyEngine
+import torch.optim as optim

-## 🧪 검증 상태 (Validation)
+model, optimizer = build_model(), optim.SGD(model.parameters(), lr=0.1)
+loader = build_loader()

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+privacy_engine = PrivacyEngine()
+model, optimizer, loader = privacy_engine.make_private_with_epsilon(
+    module=model, optimizer=optimizer, data_loader=loader,
+    target_epsilon=1.0, target_delta=1e-5, epochs=10,
+    max_grad_norm=1.0,
+)

-## 🧬 중복 검사 (Duplicate Check)
-
- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
-
-## 🕓 변경 이력 (Changelog)
-
-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
-
-## 💻 코드 패턴 (Code Patterns)
-
-**패턴 1:** *(TODO: 이 프로젝트 컨벤션 반영한 구조 스켈레톤)*
-
-```text
-# TODO
+for epoch in range(10):
+    for x, y in loader:
+        optimizer.zero_grad()
+        loss = criterion(model(x), y)
+        loss.backward()
+        optimizer.step()
+print(f"ε={privacy_engine.get_epsilon(delta=1e-5):.2f}")
 ```

-## 🤔 의사결정 기준 (Decision Criteria)
+### Federated averaging (FedAvg)
+```python
+def fed_avg(global_model, client_updates, client_weights):
+    """Weighted average of client deltas."""
+    avg_state = {}
+    total = sum(client_weights)
+    for k in global_model.state_dict():
+        avg_state[k] = sum(
+            w / total * upd[k] for upd, w in zip(client_updates, client_weights)
+        )
+    global_model.load_state_dict(avg_state)
+    return global_model

-**선택 A를 써야 할 때:**
- *(TODO)*
+# Each round:
+# 1. broadcast global model
+# 2. clients train locally (with DP optionally)
+# 3. clients send model deltas (encrypted)
+# 4. server aggregates via secure aggregation
+```

-**선택 B를 써야 할 때:**
- *(TODO)*
+### Homomorphic encryption inference (TenSEAL CKKS)
+```python
+import tenseal as ts
+ctx = ts.context(ts.SCHEME_TYPE.CKKS, poly_modulus_degree=8192,
+                 coeff_mod_bit_sizes=[60, 40, 40, 60])
+ctx.global_scale = 2**40
+ctx.generate_galois_keys()

-**기본값:**
-> *(TODO)*
+x = ts.ckks_vector(ctx, [0.1, 0.5, -0.3, 0.7])
+W = [[0.2, -0.1, 0.4, 0.05]]  # plaintext weights
+b = [0.1]
+# encrypted inference: y = W*x + b
+y_enc = x.matmul(W[0]) + b[0]
+y_plain = y_enc.decrypt()
+```

-## ❌ 안티패턴 (Anti-Patterns)
+### Secure aggregation (cross-device FL)
+```python
+# Bonawitz et al protocol sketch:
+# 1. Pairwise keys via Diffie-Hellman among N clients.
+# 2. Each client sends update + sum_{j} mask_{ij} - sum_{j} mask_{ji}.
+# 3. Server sums all -> masks cancel -> only aggregate revealed.
+# Tolerates dropouts via Shamir secret sharing of seeds.
+```

- **[안티패턴]:** *(TODO: 무엇을 하면 안 되는가 + 이유 + 대신 무엇을)*
+### Confidential GPU inference (NVIDIA H100 CC)
+```bash
+# Boot CC mode
+nvidia-smi conf-compute -srs 1
+# Verify attestation
+nvidia-smi conf-compute -gar
+# Application gets encrypted GPU-CPU bus + attested code
+```
+
+### Machine unlearning (SISA)
+```python
+# Sharded, Isolated, Sliced, Aggregated:
+# 1. Shard data into K disjoint parts; train K models.
+# 2. Aggregate (vote/avg) for inference.
+# 3. To unlearn user u: retrain only the shard containing u.
+# Cost: O(1/K) of full retrain.
+```
+
+## 매 결정 기준
+| 상황 | Approach |
+|---|---|
+| Single org, sensitive labels | DP-SGD |
+| Many phones / hospitals | Federated + secure agg + DP |
+| Cloud inference, untrusted server | TEE (H100 CC) or HE |
+| Two parties, joint model | MPC (CrypTen, MP-SPDZ) |
+| GDPR right-to-be-forgotten | SISA / approximate unlearning |
+| Need to share data externally | DP synthetic data |
+
+**기본값**: TEE (confidential computing) for inference; DP-SGD + federated for training across orgs.
+
+## 🔗 Graph
+- 부모: [[Privacy]] · [[Cryptography]] · [[Machine-Learning]]
+- 변형: [[Differential-Privacy]] · [[Federated-Learning]] · [[Homomorphic-Encryption]] · [[Secure-Multi-Party-Computation]]
+- 응용: [[Healthcare-AI]] · [[FinTech-AI]] · [[On-Device-ML]]
+- Adjacent: [[Trusted-Execution-Environment]] · [[Machine-Unlearning]] · [[Synthetic-Data]]
+
+## 🤖 LLM 활용
+**언제**: regulated data (HIPAA, GDPR, PCI), cross-org training, on-device personalization, untrusted-cloud inference.
+**언제 X**: public data, no privacy requirement — overhead not worth it.
+
+## ❌ 안티패턴
+- **Big epsilon (ε>10)**: 매 effectively no privacy.
+- **Federated without DP or secure agg**: gradients leak training data.
+- **HE for entire training**: 1000x slowdown — only feasible for inference of small models.
+- **Anonymization theater**: removing names is not privacy (re-id attacks trivial).
+- **Trust me bro confidential**: deploy without remote attestation.
+
+## 🧪 검증 / 중복
+- Verified (Apple PCC 2024, Google FL papers, NIST DP guidance, NVIDIA H100 CC docs 2024).
+- 신뢰도 A.
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — 4 pillars + TEE / unlearning 2026 update |