[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -0,0 +1,354 @@
|
||||
---
|
||||
id: mlops-model-registry
|
||||
title: MLOps — Model registry / MLflow / W&B / artifact
|
||||
category: Coding
|
||||
status: draft
|
||||
source_trust_level: B
|
||||
verification_status: conceptual
|
||||
created_at: 2026-05-09
|
||||
updated_at: 2026-05-09
|
||||
tags: [mlops, ml, vibe-coding]
|
||||
tech_stack: { language: "Python", applicable_to: ["AI", "Backend"] }
|
||||
applied_in: []
|
||||
aliases: [MLOps, MLflow, W&B, Weights and Biases, model registry, model versioning, artifact]
|
||||
---
|
||||
|
||||
# MLOps Model Registry
|
||||
|
||||
> ML model 도 version + deploy 필요. **MLflow / W&B / DVC / Vertex AI**. Train → register → stage → deploy → monitor.
|
||||
|
||||
## 📖 핵심 개념
|
||||
- Model = code + data + hyperparam + weights.
|
||||
- Registry: version 관리.
|
||||
- Stage: dev / staging / prod.
|
||||
- Lineage: 어느 dataset 으로 train.
|
||||
|
||||
## 💻 코드 패턴
|
||||
|
||||
### MLflow
|
||||
```python
|
||||
import mlflow
|
||||
|
||||
mlflow.set_tracking_uri('http://mlflow:5000')
|
||||
mlflow.set_experiment('user-churn')
|
||||
|
||||
with mlflow.start_run() as run:
|
||||
mlflow.log_param('lr', 0.001)
|
||||
mlflow.log_param('batch_size', 32)
|
||||
|
||||
model = train(...)
|
||||
|
||||
mlflow.log_metric('val_loss', 0.12)
|
||||
mlflow.log_metric('val_acc', 0.87)
|
||||
|
||||
mlflow.sklearn.log_model(model, 'model', registered_model_name='ChurnModel')
|
||||
```
|
||||
|
||||
### Model registry (MLflow)
|
||||
```python
|
||||
from mlflow.tracking import MlflowClient
|
||||
|
||||
client = MlflowClient()
|
||||
|
||||
# Register
|
||||
mv = client.create_model_version(
|
||||
name='ChurnModel',
|
||||
source=f'runs:/{run_id}/model',
|
||||
run_id=run_id,
|
||||
)
|
||||
|
||||
# Promote
|
||||
client.transition_model_version_stage(
|
||||
name='ChurnModel',
|
||||
version=mv.version,
|
||||
stage='Production',
|
||||
)
|
||||
|
||||
# Load
|
||||
model = mlflow.sklearn.load_model('models:/ChurnModel/Production')
|
||||
```
|
||||
|
||||
### W&B
|
||||
```python
|
||||
import wandb
|
||||
|
||||
wandb.init(project='churn', config={'lr': 0.001})
|
||||
for epoch in range(100):
|
||||
loss = train_step()
|
||||
wandb.log({'loss': loss, 'epoch': epoch})
|
||||
|
||||
# Save artifact
|
||||
art = wandb.Artifact('model', type='model')
|
||||
art.add_file('model.pkl')
|
||||
wandb.log_artifact(art)
|
||||
```
|
||||
|
||||
→ Hyperparam sweep + chart 가 강함.
|
||||
|
||||
### DVC (Data Version Control)
|
||||
```bash
|
||||
# Code in git, data in DVC
|
||||
dvc init
|
||||
dvc remote add -d s3 s3://bucket/dvc
|
||||
|
||||
dvc add data/train.csv
|
||||
git add data/train.csv.dvc .gitignore
|
||||
git commit -m 'add dataset'
|
||||
|
||||
# Pipeline
|
||||
dvc run -n train \
|
||||
-d data/train.csv \
|
||||
-d train.py \
|
||||
-o model.pkl \
|
||||
python train.py
|
||||
```
|
||||
|
||||
→ Git + S3 에 큰 file 영향 없음.
|
||||
|
||||
### Reproducibility
|
||||
```python
|
||||
# Seed
|
||||
import torch, numpy as np, random
|
||||
torch.manual_seed(42)
|
||||
np.random.seed(42)
|
||||
random.seed(42)
|
||||
|
||||
# Lock
|
||||
# requirements.txt 에 정확 버전
|
||||
torch==2.4.0
|
||||
transformers==4.45.0
|
||||
|
||||
# Docker for env
|
||||
FROM pytorch/pytorch:2.4.0-cuda12-runtime
|
||||
```
|
||||
|
||||
### Experiment compare
|
||||
```python
|
||||
# MLflow
|
||||
runs = mlflow.search_runs(experiment_ids=['1'], max_results=10, order_by=['metrics.val_acc DESC'])
|
||||
|
||||
# W&B
|
||||
import wandb
|
||||
api = wandb.Api()
|
||||
runs = api.runs('user/churn')
|
||||
df = pd.DataFrame([{'lr': r.config['lr'], 'acc': r.summary['val_acc']} for r in runs])
|
||||
```
|
||||
|
||||
### Model serving (MLflow)
|
||||
```bash
|
||||
mlflow models serve -m models:/ChurnModel/Production --port 5001
|
||||
|
||||
# REST
|
||||
curl http://localhost:5001/invocations \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"inputs": [[1,2,3]]}'
|
||||
```
|
||||
|
||||
### BentoML (production serving)
|
||||
```python
|
||||
import bentoml
|
||||
|
||||
@bentoml.service
|
||||
class ChurnPredictor:
|
||||
model = bentoml.models.get('churn:latest')
|
||||
|
||||
@bentoml.api
|
||||
def predict(self, features: list[float]) -> dict:
|
||||
return {'pred': self.model.predict([features])[0]}
|
||||
```
|
||||
|
||||
```bash
|
||||
bentoml build
|
||||
bentoml containerize churn:latest
|
||||
```
|
||||
|
||||
→ Docker + REST + gRPC 자동.
|
||||
|
||||
### Triton (NVIDIA inference)
|
||||
```
|
||||
- 다중 model
|
||||
- 다중 framework (TF, PyTorch, ONNX)
|
||||
- Dynamic batching
|
||||
- GPU 친화
|
||||
```
|
||||
|
||||
### TorchServe
|
||||
```bash
|
||||
torchserve --start --models my_model=model.mar
|
||||
curl http://localhost:8080/predictions/my_model -d @input.json
|
||||
```
|
||||
|
||||
### Vertex AI / SageMaker
|
||||
```python
|
||||
# Vertex AI
|
||||
from google.cloud import aiplatform
|
||||
|
||||
aiplatform.init(project='my-project')
|
||||
model = aiplatform.Model.upload(
|
||||
display_name='churn',
|
||||
artifact_uri='gs://bucket/model',
|
||||
serving_container_image_uri='gcr.io/.../tf-serving',
|
||||
)
|
||||
endpoint = model.deploy(machine_type='n1-standard-4', min_replica_count=1)
|
||||
```
|
||||
|
||||
→ Managed. Auto-scale + monitoring.
|
||||
|
||||
### Feature store
|
||||
```python
|
||||
# Feast
|
||||
from feast import FeatureStore
|
||||
store = FeatureStore(repo_path='.')
|
||||
|
||||
# Online (low latency)
|
||||
features = store.get_online_features(
|
||||
features=['user:age', 'user:total_spent'],
|
||||
entity_rows=[{'user_id': 123}],
|
||||
).to_dict()
|
||||
|
||||
# Offline (training)
|
||||
df = store.get_historical_features(
|
||||
entity_df=entity_df,
|
||||
features=[...],
|
||||
).to_df()
|
||||
```
|
||||
|
||||
→ Train / serve consistency.
|
||||
|
||||
### Data validation (Great Expectations / Deequ)
|
||||
```python
|
||||
import great_expectations as ge
|
||||
|
||||
df = ge.from_pandas(train_df)
|
||||
df.expect_column_values_to_be_between('age', 0, 120)
|
||||
df.expect_column_to_exist('user_id')
|
||||
result = df.validate()
|
||||
```
|
||||
|
||||
→ Train 전 / inference 전 schema check.
|
||||
|
||||
### Schema (Pydantic / Feast)
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
|
||||
class Features(BaseModel):
|
||||
age: int
|
||||
income: float
|
||||
region: str
|
||||
|
||||
# API input → validate
|
||||
@app.post('/predict')
|
||||
def predict(input: Features):
|
||||
return {'pred': model.predict([input.dict().values()])[0]}
|
||||
```
|
||||
|
||||
### CI / CD for ML
|
||||
```yaml
|
||||
# .github/workflows/train.yml
|
||||
on: [push]
|
||||
jobs:
|
||||
train:
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- run: dvc pull
|
||||
- run: pip install -r requirements.txt
|
||||
- run: python train.py
|
||||
- run: dvc push # save artifacts
|
||||
- run: |
|
||||
if python compare.py; then
|
||||
mlflow promote ...
|
||||
fi
|
||||
```
|
||||
|
||||
→ Continuous training.
|
||||
|
||||
### Model card (documentation)
|
||||
```markdown
|
||||
# Model Card: Churn Predictor v3.1
|
||||
|
||||
## Intended use
|
||||
Predict user churn for SaaS billing dashboard.
|
||||
|
||||
## Training data
|
||||
- Source: 2025-01-01 - 2026-04-30
|
||||
- Size: 1.2M users
|
||||
- Features: 23
|
||||
|
||||
## Performance
|
||||
- Val accuracy: 0.87
|
||||
- Val AUC: 0.91
|
||||
- F1: 0.83
|
||||
|
||||
## Limitations
|
||||
- Trained on US-only data
|
||||
- Cold-start (< 30 days) accuracy ↓
|
||||
- 30%+ class imbalance
|
||||
|
||||
## Bias
|
||||
- ...
|
||||
```
|
||||
|
||||
→ Trust + governance.
|
||||
|
||||
### Prompt versioning (LLM as model)
|
||||
```python
|
||||
# Promptfoo / LangSmith / Helicone
|
||||
prompts = {
|
||||
'v1': 'Summarize: {text}',
|
||||
'v2': 'Provide a 3-sentence summary: {text}',
|
||||
}
|
||||
|
||||
# A/B test in prod
|
||||
prompt = prompts[user.bucket]
|
||||
```
|
||||
|
||||
### Golden dataset
|
||||
```python
|
||||
# Test set 가 변경 X
|
||||
test_df = pd.read_parquet('s3://bucket/golden_test.parquet')
|
||||
acc = evaluate(model, test_df)
|
||||
assert acc > 0.85, 'regression'
|
||||
```
|
||||
|
||||
→ Regression check.
|
||||
|
||||
### Online + offline metrics
|
||||
```
|
||||
Offline (train): accuracy, AUC, F1
|
||||
Online (prod): user-clicked, dwell time, conversion
|
||||
|
||||
→ Offline 가 거의 항상 ≠ online.
|
||||
A/B test 가 진실.
|
||||
```
|
||||
|
||||
## 🤔 의사결정 기준
|
||||
| 상황 | 추천 |
|
||||
|---|---|
|
||||
| Single team / experiment | MLflow |
|
||||
| Hyperparam sweep | W&B |
|
||||
| Data versioning | DVC |
|
||||
| Production serving | BentoML / Triton |
|
||||
| Cloud managed | Vertex / SageMaker |
|
||||
| Feature store | Feast / Tecton |
|
||||
| Validation | Great Expectations |
|
||||
| Docs | Model card |
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **No version**: 어느 model 가 prod?
|
||||
- **Train / serve drift**: feature 다르면 깨짐.
|
||||
- **No monitoring**: silent regression.
|
||||
- **Hyperparam in script**: 추적 X.
|
||||
- **Big artifact in git**: clone 폭발.
|
||||
- **No reproducibility**: seed 없음.
|
||||
- **Direct prod deploy**: staging 없음.
|
||||
|
||||
## 🤖 LLM 활용 힌트
|
||||
- MLflow / W&B 가 baseline.
|
||||
- Feature store 가 train/serve consistency.
|
||||
- BentoML / Triton 가 production serving.
|
||||
- Model card = governance + trust.
|
||||
|
||||
## 🔗 관련 문서
|
||||
- [[AI_Local_LLM_Inference]]
|
||||
- [[Data_Eng_dbt]]
|
||||
- [[DevOps_CI_CD_Pipeline_Patterns]]
|
||||
Reference in New Issue
Block a user