Files
2nd/10_Wiki/Topics/AI_and_ML/다수 팀 협업 환경.md
T
2026-05-10 22:08:15 +09:00

7.4 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-다수-팀-협업-환경 다수 팀 협업 환경 10_Wiki/Topics verified self
Multi-team Collaboration
Multi-Team AI Workflow
Cross-team AI Coordination
none A 0.9 applied
collaboration
multi-team
ai-workflow
ml-platform
2026-05-10 pending
language framework
python mlflow-langfuse-argo

다수 팀 협업 환경

매 한 줄

"매 multi-team AI 환경은 shared model registry + isolated namespace + cross-team observability 의 triad". 2024년 단일 ML team 시대가 끝나고, 2026년 enterprise 의 매 product team 마다 LLM/agent 를 owning. Shared infra (model registry, eval harness, prompt store) 위에 team-isolated workspace (Argo namespace, Langfuse project) 를 결합.

매 핵심

매 conflict 영역

  • model versioning: team A 가 fine-tune 한 Llama-3.3 70B 를 team B 도 쓰는데 update 시 regression.
  • prompt drift: 동일 base prompt 가 team 마다 fork 되어 7가지 variant 공존.
  • eval inconsistency: team 마다 다른 eval set → 비교 불가.
  • GPU contention: H200 cluster 의 fair-share scheduling 부재 시 noisy neighbor.

매 governance layer

  • Model Registry (MLflow / Weights & Biases): canonical model card, semver tag, deprecation policy.
  • Prompt Store (Langfuse / Humanloop): versioned prompts, A/B winner promotion.
  • Eval Harness (Inspect AI / promptfoo): shared regression suite — 매 model bump 시 자동 trigger.
  • Observability (Langfuse + OpenTelemetry): 매 team project 분리, leadership level 의 cross-team dashboard.

매 응용

  1. Platform team 이 base infrastructure 제공, product team 이 application layer 구축.
  2. AI Center of Excellence — 매 quarterly review of cross-team incidents.
  3. RACI matrix — model owner / prompt owner / eval owner 명시.

💻 패턴

MLflow shared registry — team isolation via aliases

import mlflow
from mlflow import MlflowClient

client = MlflowClient(tracking_uri="https://mlflow.corp/")

# Platform team registers canonical model
mv = client.create_model_version(
    name="llama-3.3-70b-instruct-finetuned",
    source="s3://corp-models/llama33-v4/",
    description="Q2 2026 finetune; eval set v3.2",
)
client.set_registered_model_alias(
    name="llama-3.3-70b-instruct-finetuned",
    alias="prod-team-search",
    version=mv.version,
)
client.set_registered_model_alias(
    name="llama-3.3-70b-instruct-finetuned",
    alias="prod-team-support",
    version=mv.version,
)
# Each team pins its own alias → independent rollout cadence

Langfuse multi-project prompt versioning

from langfuse import Langfuse

lf = Langfuse(public_key=PK, secret_key=SK, host="https://langfuse.corp")

# Team A creates a prompt
lf.create_prompt(
    name="search/intent-classifier",
    prompt="You classify user search intent. Categories: {{categories}}.",
    labels=["production"],  # auto-promoted version label
    config={"model": "claude-opus-4-7", "temperature": 0.0},
)

# Team B compiles the same prompt (linked, not copied)
prompt = lf.get_prompt("search/intent-classifier", label="production")
compiled = prompt.compile(categories="navigational, informational, transactional")

Argo Workflows — team-namespaced GPU jobs with priority class

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: finetune-team-search-
  namespace: ai-team-search
spec:
  entrypoint: train
  podGC: { strategy: OnWorkflowSuccess }
  templates:
    - name: train
      priorityClassName: team-search-prod  # fair-share scheduling
      container:
        image: corp-registry/finetune:cuda12.6
        resources:
          limits: { nvidia.com/gpu: 8, memory: 1Ti }
        env:
          - name: MLFLOW_TRACKING_URI
            value: https://mlflow.corp
          - name: WANDB_PROJECT
            value: team-search

Cross-team eval harness with Inspect AI

from inspect_ai import eval_async, Task, task
from inspect_ai.dataset import json_dataset
from inspect_ai.scorer import model_graded_qa

@task
def shared_safety_suite():
    return Task(
        dataset=json_dataset("s3://corp-evals/safety-v3.2.jsonl"),
        scorer=model_graded_qa(model="claude-opus-4-7"),
    )

# Run across all team-owned models nightly
models = [
    "team-search/llama-3.3-70b@prod",
    "team-support/llama-3.3-70b@prod",
    "team-rec/llama-3.3-70b@prod",
]
results = await eval_async(shared_safety_suite, model=models)
# Publish to shared dashboard; alert if any team regresses >2% vs last week

OPA policy gate for model promotion

package modelregistry.promotion

deny[msg] {
    input.action == "promote"
    input.target_alias == "prod"
    not input.eval_results.safety_pass_rate >= 0.95
    msg := sprintf("safety_pass_rate=%.3f below 0.95", [input.eval_results.safety_pass_rate])
}

deny[msg] {
    input.action == "promote"
    not input.has_owner_approval
    msg := "missing model_owner approval"
}

Cross-team incident postmortem template (YAML, version-controlled)

incident_id: 2026-Q2-013
date: 2026-05-08
owning_team: team-search
affected_teams: [team-support, team-rec]
root_cause: |
  team-search rolled new finetune to alias=prod without notifying
  downstream consumers; intent-classifier prompt assumed older format.
detection: Langfuse anomaly (latency p95 spike) — 14 min
resolution: rolled back model alias; published deprecation policy
action_items:
  - owner: platform-team
    due: 2026-05-22
    task: enforce 7-day deprecation notice via OPA

매 결정 기준

상황 Approach
2-3 teams, single product shared monorepo + single MLflow project
5-15 teams, mixed maturity platform team + per-team namespace
15+ teams, enterprise full governance layer + AI CoE + OPA gates
Regulated (finance/health) add audit log + immutable model lineage

기본값: MLflow registry + Langfuse prompt store + Argo namespace per team + shared Inspect AI eval suite.

🔗 Graph

🤖 LLM 활용

언제: 매 enterprise 의 5+ teams 가 LLM/agent product 를 ship 할 때, shared eval/registry 가 미존재할 때. 언제 X: 매 single team / single model — 매 over-engineering. Notion + GitHub 면 충분.

안티패턴

  • Shadow IT model: team 이 platform 우회하여 personal HF token 으로 model serving — security/cost blind spot.
  • Prompt copy-paste: Slack 으로 prompt 공유 → drift, no versioning.
  • Eval set fragmentation: team 마다 자체 eval → cross-team comparison 불가.
  • No deprecation policy: alias=prod 의 silent breaking change.
  • Single GPU pool, no priority class: noisy neighbor 가 매 production inference 를 starvation.

🧪 검증 / 중복

  • Verified (MLflow 2.x docs, Langfuse v3 multi-project, Argo Workflows fair-share scheduling, Inspect AI 0.3+).
  • 신뢰도 A — 매 production-grade enterprise pattern.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — multi-team AI governance triad + 6 patterns