--- id: wiki-2026-0508-privacy-preserving-ai title: Privacy Preserving AI category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Privacy-Preserving Machine Learning, PPML, Confidential AI] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [privacy, security, differential-privacy, federated-learning, cryptography] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: Opacus / TF-Federated / TenSEAL / PySyft --- # Privacy Preserving AI ## 매 한 줄 > **"매 train and infer on data without exposing it — 4 pillars: DP, FL, HE, MPC"**. GDPR (2018) 와 healthcare/finance regulation 으로 driven, 2024 EU AI Act 와 US executive orders 로 mainstream. 2026 currently confidential computing (TEE: Intel TDX, NVIDIA H100 CC, Apple PCC) 가 production deployment 의 default. ## 매 핵심 ### 매 4 pillars 1. **Differential Privacy (DP)**: noise 추가 to bound info leakage. Calibrated by epsilon (ε). 2. **Federated Learning (FL)**: model goes to data, not data to model. 3. **Homomorphic Encryption (HE)**: compute on ciphertext directly. 4. **Secure Multi-Party Computation (MPC)**: parties jointly compute without revealing inputs. ### 매 production additions (2024-2026) - **TEE / Confidential computing**: Intel TDX, AMD SEV-SNP, NVIDIA H100 confidential GPU, Apple Private Cloud Compute. - **Synthetic data**: GAN/diffusion-generated; near-zero re-id risk if done right. - **Machine unlearning**: GDPR right-to-be-forgotten compliance. ### 매 trade-offs | Method | Privacy | Utility | Compute | Deployed? | |---|---|---|---|---| | DP-SGD (ε≈1) | High | -2 to -5% acc | 2-5x | Yes (Apple, Google) | | Federated | Medium | ~same | High comm | Yes (Gboard, healthcare) | | HE (CKKS) | Very high | exact | 1000-10000x | Niche | | MPC | Very high | exact | 100-1000x | Niche | | TEE | High (HW trust) | ~same | ~1.1x | Rapidly growing | ## 💻 패턴 ### DP-SGD with Opacus (PyTorch) ```python from opacus import PrivacyEngine import torch.optim as optim model, optimizer = build_model(), optim.SGD(model.parameters(), lr=0.1) loader = build_loader() privacy_engine = PrivacyEngine() model, optimizer, loader = privacy_engine.make_private_with_epsilon( module=model, optimizer=optimizer, data_loader=loader, target_epsilon=1.0, target_delta=1e-5, epochs=10, max_grad_norm=1.0, ) for epoch in range(10): for x, y in loader: optimizer.zero_grad() loss = criterion(model(x), y) loss.backward() optimizer.step() print(f"ε={privacy_engine.get_epsilon(delta=1e-5):.2f}") ``` ### Federated averaging (FedAvg) ```python def fed_avg(global_model, client_updates, client_weights): """Weighted average of client deltas.""" avg_state = {} total = sum(client_weights) for k in global_model.state_dict(): avg_state[k] = sum( w / total * upd[k] for upd, w in zip(client_updates, client_weights) ) global_model.load_state_dict(avg_state) return global_model # Each round: # 1. broadcast global model # 2. clients train locally (with DP optionally) # 3. clients send model deltas (encrypted) # 4. server aggregates via secure aggregation ``` ### Homomorphic encryption inference (TenSEAL CKKS) ```python import tenseal as ts ctx = ts.context(ts.SCHEME_TYPE.CKKS, poly_modulus_degree=8192, coeff_mod_bit_sizes=[60, 40, 40, 60]) ctx.global_scale = 2**40 ctx.generate_galois_keys() x = ts.ckks_vector(ctx, [0.1, 0.5, -0.3, 0.7]) W = [[0.2, -0.1, 0.4, 0.05]] # plaintext weights b = [0.1] # encrypted inference: y = W*x + b y_enc = x.matmul(W[0]) + b[0] y_plain = y_enc.decrypt() ``` ### Secure aggregation (cross-device FL) ```python # Bonawitz et al protocol sketch: # 1. Pairwise keys via Diffie-Hellman among N clients. # 2. Each client sends update + sum_{j} mask_{ij} - sum_{j} mask_{ji}. # 3. Server sums all -> masks cancel -> only aggregate revealed. # Tolerates dropouts via Shamir secret sharing of seeds. ``` ### Confidential GPU inference (NVIDIA H100 CC) ```bash # Boot CC mode nvidia-smi conf-compute -srs 1 # Verify attestation nvidia-smi conf-compute -gar # Application gets encrypted GPU-CPU bus + attested code ``` ### Machine unlearning (SISA) ```python # Sharded, Isolated, Sliced, Aggregated: # 1. Shard data into K disjoint parts; train K models. # 2. Aggregate (vote/avg) for inference. # 3. To unlearn user u: retrain only the shard containing u. # Cost: O(1/K) of full retrain. ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Single org, sensitive labels | DP-SGD | | Many phones / hospitals | Federated + secure agg + DP | | Cloud inference, untrusted server | TEE (H100 CC) or HE | | Two parties, joint model | MPC (CrypTen, MP-SPDZ) | | GDPR right-to-be-forgotten | SISA / approximate unlearning | | Need to share data externally | DP synthetic data | **기본값**: TEE (confidential computing) for inference; DP-SGD + federated for training across orgs. ## 🔗 Graph - 부모: [[Privacy]] · [[Practical-Cryptography|Cryptography]] · [[Machine-Learning]] - 변형: [[Differential-Privacy]] · [[Federated-Learning]] · [[Homomorphic-Encryption]] · [[Secure-Multi-Party-Computation]] - 응용: [[On-Device-ML]] - Adjacent: [[Synthetic-Data]] ## 🤖 LLM 활용 **언제**: regulated data (HIPAA, GDPR, PCI), cross-org training, on-device personalization, untrusted-cloud inference. **언제 X**: public data, no privacy requirement — overhead not worth it. ## ❌ 안티패턴 - **Big epsilon (ε>10)**: 매 effectively no privacy. - **Federated without DP or secure agg**: gradients leak training data. - **HE for entire training**: 1000x slowdown — only feasible for inference of small models. - **Anonymization theater**: removing names is not privacy (re-id attacks trivial). - **Trust me bro confidential**: deploy without remote attestation. ## 🧪 검증 / 중복 - Verified (Apple PCC 2024, Google FL papers, NIST DP guidance, NVIDIA H100 CC docs 2024). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — 4 pillars + TEE / unlearning 2026 update |