210 lines
7.4 KiB
Markdown
210 lines
7.4 KiB
Markdown
---
|
|
id: wiki-2026-0508-다수-팀-협업-환경
|
|
title: 다수 팀 협업 환경
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Multi-team Collaboration, Multi-Team AI Workflow, Cross-team AI Coordination]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [collaboration, multi-team, ai-workflow, ml-platform]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: python
|
|
framework: mlflow-langfuse-argo
|
|
---
|
|
|
|
# 다수 팀 협업 환경
|
|
|
|
## 매 한 줄
|
|
> **"매 multi-team AI 환경은 shared model registry + isolated namespace + cross-team observability 의 triad"**. 2024년 단일 ML team 시대가 끝나고, 2026년 enterprise 의 매 product team 마다 LLM/agent 를 owning. Shared infra (model registry, eval harness, prompt store) 위에 team-isolated workspace (Argo namespace, Langfuse project) 를 결합.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 conflict 영역
|
|
- **model versioning**: team A 가 fine-tune 한 Llama-3.3 70B 를 team B 도 쓰는데 update 시 regression.
|
|
- **prompt drift**: 동일 base prompt 가 team 마다 fork 되어 7가지 variant 공존.
|
|
- **eval inconsistency**: team 마다 다른 eval set → 비교 불가.
|
|
- **GPU contention**: H200 cluster 의 fair-share scheduling 부재 시 noisy neighbor.
|
|
|
|
### 매 governance layer
|
|
- **Model Registry (MLflow / Weights & Biases)**: canonical model card, semver tag, deprecation policy.
|
|
- **Prompt Store (Langfuse / Humanloop)**: versioned prompts, A/B winner promotion.
|
|
- **Eval Harness (Inspect AI / promptfoo)**: shared regression suite — 매 model bump 시 자동 trigger.
|
|
- **Observability (Langfuse + OpenTelemetry)**: 매 team project 분리, leadership level 의 cross-team dashboard.
|
|
|
|
### 매 응용
|
|
1. Platform team 이 base infrastructure 제공, product team 이 application layer 구축.
|
|
2. AI Center of Excellence — 매 quarterly review of cross-team incidents.
|
|
3. RACI matrix — model owner / prompt owner / eval owner 명시.
|
|
|
|
## 💻 패턴
|
|
|
|
### MLflow shared registry — team isolation via aliases
|
|
```python
|
|
import mlflow
|
|
from mlflow import MlflowClient
|
|
|
|
client = MlflowClient(tracking_uri="https://mlflow.corp/")
|
|
|
|
# Platform team registers canonical model
|
|
mv = client.create_model_version(
|
|
name="llama-3.3-70b-instruct-finetuned",
|
|
source="s3://corp-models/llama33-v4/",
|
|
description="Q2 2026 finetune; eval set v3.2",
|
|
)
|
|
client.set_registered_model_alias(
|
|
name="llama-3.3-70b-instruct-finetuned",
|
|
alias="prod-team-search",
|
|
version=mv.version,
|
|
)
|
|
client.set_registered_model_alias(
|
|
name="llama-3.3-70b-instruct-finetuned",
|
|
alias="prod-team-support",
|
|
version=mv.version,
|
|
)
|
|
# Each team pins its own alias → independent rollout cadence
|
|
```
|
|
|
|
### Langfuse multi-project prompt versioning
|
|
```python
|
|
from langfuse import Langfuse
|
|
|
|
lf = Langfuse(public_key=PK, secret_key=SK, host="https://langfuse.corp")
|
|
|
|
# Team A creates a prompt
|
|
lf.create_prompt(
|
|
name="search/intent-classifier",
|
|
prompt="You classify user search intent. Categories: {{categories}}.",
|
|
labels=["production"], # auto-promoted version label
|
|
config={"model": "claude-opus-4-7", "temperature": 0.0},
|
|
)
|
|
|
|
# Team B compiles the same prompt (linked, not copied)
|
|
prompt = lf.get_prompt("search/intent-classifier", label="production")
|
|
compiled = prompt.compile(categories="navigational, informational, transactional")
|
|
```
|
|
|
|
### Argo Workflows — team-namespaced GPU jobs with priority class
|
|
```yaml
|
|
apiVersion: argoproj.io/v1alpha1
|
|
kind: Workflow
|
|
metadata:
|
|
generateName: finetune-team-search-
|
|
namespace: ai-team-search
|
|
spec:
|
|
entrypoint: train
|
|
podGC: { strategy: OnWorkflowSuccess }
|
|
templates:
|
|
- name: train
|
|
priorityClassName: team-search-prod # fair-share scheduling
|
|
container:
|
|
image: corp-registry/finetune:cuda12.6
|
|
resources:
|
|
limits: { nvidia.com/gpu: 8, memory: 1Ti }
|
|
env:
|
|
- name: MLFLOW_TRACKING_URI
|
|
value: https://mlflow.corp
|
|
- name: WANDB_PROJECT
|
|
value: team-search
|
|
```
|
|
|
|
### Cross-team eval harness with Inspect AI
|
|
```python
|
|
from inspect_ai import eval_async, Task, task
|
|
from inspect_ai.dataset import json_dataset
|
|
from inspect_ai.scorer import model_graded_qa
|
|
|
|
@task
|
|
def shared_safety_suite():
|
|
return Task(
|
|
dataset=json_dataset("s3://corp-evals/safety-v3.2.jsonl"),
|
|
scorer=model_graded_qa(model="claude-opus-4-7"),
|
|
)
|
|
|
|
# Run across all team-owned models nightly
|
|
models = [
|
|
"team-search/llama-3.3-70b@prod",
|
|
"team-support/llama-3.3-70b@prod",
|
|
"team-rec/llama-3.3-70b@prod",
|
|
]
|
|
results = await eval_async(shared_safety_suite, model=models)
|
|
# Publish to shared dashboard; alert if any team regresses >2% vs last week
|
|
```
|
|
|
|
### OPA policy gate for model promotion
|
|
```rego
|
|
package modelregistry.promotion
|
|
|
|
deny[msg] {
|
|
input.action == "promote"
|
|
input.target_alias == "prod"
|
|
not input.eval_results.safety_pass_rate >= 0.95
|
|
msg := sprintf("safety_pass_rate=%.3f below 0.95", [input.eval_results.safety_pass_rate])
|
|
}
|
|
|
|
deny[msg] {
|
|
input.action == "promote"
|
|
not input.has_owner_approval
|
|
msg := "missing model_owner approval"
|
|
}
|
|
```
|
|
|
|
### Cross-team incident postmortem template (YAML, version-controlled)
|
|
```yaml
|
|
incident_id: 2026-Q2-013
|
|
date: 2026-05-08
|
|
owning_team: team-search
|
|
affected_teams: [team-support, team-rec]
|
|
root_cause: |
|
|
team-search rolled new finetune to alias=prod without notifying
|
|
downstream consumers; intent-classifier prompt assumed older format.
|
|
detection: Langfuse anomaly (latency p95 spike) — 14 min
|
|
resolution: rolled back model alias; published deprecation policy
|
|
action_items:
|
|
- owner: platform-team
|
|
due: 2026-05-22
|
|
task: enforce 7-day deprecation notice via OPA
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| 2-3 teams, single product | shared monorepo + single MLflow project |
|
|
| 5-15 teams, mixed maturity | platform team + per-team namespace |
|
|
| 15+ teams, enterprise | full governance layer + AI CoE + OPA gates |
|
|
| Regulated (finance/health) | add audit log + immutable model lineage |
|
|
|
|
**기본값**: MLflow registry + Langfuse prompt store + Argo namespace per team + shared Inspect AI eval suite.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[ML_Platform]] · [[AI_Governance]]
|
|
- 변형: [[Single_Team_Workflow]] · [[Federated_Learning_Org]]
|
|
- 응용: [[Model_Registry]] · [[Prompt_Engineering_at_Scale]] · [[Large_Frontend_Projects]]
|
|
- Adjacent: [[Iterative Prompting]] · [[Parameter]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 enterprise 의 5+ teams 가 LLM/agent product 를 ship 할 때, shared eval/registry 가 미존재할 때.
|
|
**언제 X**: 매 single team / single model — 매 over-engineering. Notion + GitHub 면 충분.
|
|
|
|
## ❌ 안티패턴
|
|
- **Shadow IT model**: team 이 platform 우회하여 personal HF token 으로 model serving — security/cost blind spot.
|
|
- **Prompt copy-paste**: Slack 으로 prompt 공유 → drift, no versioning.
|
|
- **Eval set fragmentation**: team 마다 자체 eval → cross-team comparison 불가.
|
|
- **No deprecation policy**: alias=prod 의 silent breaking change.
|
|
- **Single GPU pool, no priority class**: noisy neighbor 가 매 production inference 를 starvation.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (MLflow 2.x docs, Langfuse v3 multi-project, Argo Workflows fair-share scheduling, Inspect AI 0.3+).
|
|
- 신뢰도 A — 매 production-grade enterprise pattern.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — multi-team AI governance triad + 6 patterns |
|