[G1-Sync] Manual knowledge update

2026-05-09 22:47:42 +09:00
parent 93ec7e9056
commit 21ac3ed255
56 changed files with 22043 additions and 43 deletions
@@ -0,0 +1,354 @@
+---
+id: mlops-model-registry
+title: MLOps — Model registry / MLflow / W&B / artifact
+category: Coding
+status: draft
+source_trust_level: B
+verification_status: conceptual
+created_at: 2026-05-09
+updated_at: 2026-05-09
+tags: [mlops, ml, vibe-coding]
+tech_stack: { language: "Python", applicable_to: ["AI", "Backend"] }
+applied_in: []
+aliases: [MLOps, MLflow, W&B, Weights and Biases, model registry, model versioning, artifact]
+---
+
+# MLOps Model Registry
+
+> ML model 도 version + deploy 필요. **MLflow / W&B / DVC / Vertex AI**. Train → register → stage → deploy → monitor.
+
+## 📖 핵심 개념
+- Model = code + data + hyperparam + weights.
+- Registry: version 관리.
+- Stage: dev / staging / prod.
+- Lineage: 어느 dataset 으로 train.
+
+## 💻 코드 패턴
+
+### MLflow
+```python
+import mlflow
+
+mlflow.set_tracking_uri('http://mlflow:5000')
+mlflow.set_experiment('user-churn')
+
+with mlflow.start_run() as run:
+    mlflow.log_param('lr', 0.001)
+    mlflow.log_param('batch_size', 32)
+    
+    model = train(...)
+    
+    mlflow.log_metric('val_loss', 0.12)
+    mlflow.log_metric('val_acc', 0.87)
+    
+    mlflow.sklearn.log_model(model, 'model', registered_model_name='ChurnModel')
+```
+
+### Model registry (MLflow)
+```python
+from mlflow.tracking import MlflowClient
+
+client = MlflowClient()
+
+# Register
+mv = client.create_model_version(
+    name='ChurnModel',
+    source=f'runs:/{run_id}/model',
+    run_id=run_id,
+)
+
+# Promote
+client.transition_model_version_stage(
+    name='ChurnModel',
+    version=mv.version,
+    stage='Production',
+)
+
+# Load
+model = mlflow.sklearn.load_model('models:/ChurnModel/Production')
+```
+
+### W&B
+```python
+import wandb
+
+wandb.init(project='churn', config={'lr': 0.001})
+for epoch in range(100):
+    loss = train_step()
+    wandb.log({'loss': loss, 'epoch': epoch})
+
+# Save artifact
+art = wandb.Artifact('model', type='model')
+art.add_file('model.pkl')
+wandb.log_artifact(art)
+```
+
+→ Hyperparam sweep + chart 가 강함.
+
+### DVC (Data Version Control)
+```bash
+# Code in git, data in DVC
+dvc init
+dvc remote add -d s3 s3://bucket/dvc
+
+dvc add data/train.csv
+git add data/train.csv.dvc .gitignore
+git commit -m 'add dataset'
+
+# Pipeline
+dvc run -n train \
+    -d data/train.csv \
+    -d train.py \
+    -o model.pkl \
+    python train.py
+```
+
+→ Git + S3 에 큰 file 영향 없음.
+
+### Reproducibility
+```python
+# Seed
+import torch, numpy as np, random
+torch.manual_seed(42)
+np.random.seed(42)
+random.seed(42)
+
+# Lock
+# requirements.txt 에 정확 버전
+torch==2.4.0
+transformers==4.45.0
+
+# Docker for env
+FROM pytorch/pytorch:2.4.0-cuda12-runtime
+```
+
+### Experiment compare
+```python
+# MLflow
+runs = mlflow.search_runs(experiment_ids=['1'], max_results=10, order_by=['metrics.val_acc DESC'])
+
+# W&B
+import wandb
+api = wandb.Api()
+runs = api.runs('user/churn')
+df = pd.DataFrame([{'lr': r.config['lr'], 'acc': r.summary['val_acc']} for r in runs])
+```
+
+### Model serving (MLflow)
+```bash
+mlflow models serve -m models:/ChurnModel/Production --port 5001
+
+# REST
+curl http://localhost:5001/invocations \
+  -H 'Content-Type: application/json' \
+  -d '{"inputs": [[1,2,3]]}'
+```
+
+### BentoML (production serving)
+```python
+import bentoml
+
+@bentoml.service
+class ChurnPredictor:
+    model = bentoml.models.get('churn:latest')
+    
+    @bentoml.api
+    def predict(self, features: list[float]) -> dict:
+        return {'pred': self.model.predict([features])[0]}
+```
+
+```bash
+bentoml build
+bentoml containerize churn:latest
+```
+
+→ Docker + REST + gRPC 자동.
+
+### Triton (NVIDIA inference)
+```
+- 다중 model
+- 다중 framework (TF, PyTorch, ONNX)
+- Dynamic batching
+- GPU 친화
+```
+
+### TorchServe
+```bash
+torchserve --start --models my_model=model.mar
+curl http://localhost:8080/predictions/my_model -d @input.json
+```
+
+### Vertex AI / SageMaker
+```python
+# Vertex AI
+from google.cloud import aiplatform
+
+aiplatform.init(project='my-project')
+model = aiplatform.Model.upload(
+    display_name='churn',
+    artifact_uri='gs://bucket/model',
+    serving_container_image_uri='gcr.io/.../tf-serving',
+)
+endpoint = model.deploy(machine_type='n1-standard-4', min_replica_count=1)
+```
+
+→ Managed. Auto-scale + monitoring.
+
+### Feature store
+```python
+# Feast
+from feast import FeatureStore
+store = FeatureStore(repo_path='.')
+
+# Online (low latency)
+features = store.get_online_features(
+    features=['user:age', 'user:total_spent'],
+    entity_rows=[{'user_id': 123}],
+).to_dict()
+
+# Offline (training)
+df = store.get_historical_features(
+    entity_df=entity_df,
+    features=[...],
+).to_df()
+```
+
+→ Train / serve consistency.
+
+### Data validation (Great Expectations / Deequ)
+```python
+import great_expectations as ge
+
+df = ge.from_pandas(train_df)
+df.expect_column_values_to_be_between('age', 0, 120)
+df.expect_column_to_exist('user_id')
+result = df.validate()
+```
+
+→ Train 전 / inference 전 schema check.
+
+### Schema (Pydantic / Feast)
+```python
+from pydantic import BaseModel
+
+class Features(BaseModel):
+    age: int
+    income: float
+    region: str
+
+# API input → validate
+@app.post('/predict')
+def predict(input: Features):
+    return {'pred': model.predict([input.dict().values()])[0]}
+```
+
+### CI / CD for ML
+```yaml
+# .github/workflows/train.yml
+on: [push]
+jobs:
+  train:
+    steps:
+      - uses: actions/checkout@v4
+      - run: dvc pull
+      - run: pip install -r requirements.txt
+      - run: python train.py
+      - run: dvc push  # save artifacts
+      - run: |
+          if python compare.py; then
+            mlflow promote ...
+          fi
+```
+
+→ Continuous training.
+
+### Model card (documentation)
+```markdown
+# Model Card: Churn Predictor v3.1
+
+## Intended use
+Predict user churn for SaaS billing dashboard.
+
+## Training data
+- Source: 2025-01-01 - 2026-04-30
+- Size: 1.2M users
+- Features: 23
+
+## Performance
+- Val accuracy: 0.87
+- Val AUC: 0.91
+- F1: 0.83
+
+## Limitations
+- Trained on US-only data
+- Cold-start (< 30 days) accuracy ↓
+- 30%+ class imbalance
+
+## Bias
+- ...
+```
+
+→ Trust + governance.
+
+### Prompt versioning (LLM as model)
+```python
+# Promptfoo / LangSmith / Helicone
+prompts = {
+  'v1': 'Summarize: {text}',
+  'v2': 'Provide a 3-sentence summary: {text}',
+}
+
+# A/B test in prod
+prompt = prompts[user.bucket]
+```
+
+### Golden dataset
+```python
+# Test set 가 변경 X
+test_df = pd.read_parquet('s3://bucket/golden_test.parquet')
+acc = evaluate(model, test_df)
+assert acc > 0.85, 'regression'
+```
+
+→ Regression check.
+
+### Online + offline metrics
+```
+Offline (train): accuracy, AUC, F1
+Online (prod): user-clicked, dwell time, conversion
+
+→ Offline 가 거의 항상 ≠ online.
+A/B test 가 진실.
+```
+
+## 🤔 의사결정 기준
+| 상황 | 추천 |
+|---|---|
+| Single team / experiment | MLflow |
+| Hyperparam sweep | W&B |
+| Data versioning | DVC |
+| Production serving | BentoML / Triton |
+| Cloud managed | Vertex / SageMaker |
+| Feature store | Feast / Tecton |
+| Validation | Great Expectations |
+| Docs | Model card |
+
+## ❌ 안티패턴
+- **No version**: 어느 model 가 prod?
+- **Train / serve drift**: feature 다르면 깨짐.
+- **No monitoring**: silent regression.
+- **Hyperparam in script**: 추적 X.
+- **Big artifact in git**: clone 폭발.
+- **No reproducibility**: seed 없음.
+- **Direct prod deploy**: staging 없음.
+
+## 🤖 LLM 활용 힌트
+- MLflow / W&B 가 baseline.
+- Feature store 가 train/serve consistency.
+- BentoML / Triton 가 production serving.
+- Model card = governance + trust.
+
+## 🔗 관련 문서
+- [[AI_Local_LLM_Inference]]
+- [[Data_Eng_dbt]]
+- [[DevOps_CI_CD_Pipeline_Patterns]]