--- id: wiki-2026-0508-자연어-아티팩트-natural-language-artifa title: 자연어 아티팩트 (Natural Language Artifacts) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Natural Language Artifacts, NL Artifacts, 자연어 산출물, Prompt Artifacts] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [llm, prompt-engineering, artifacts, knowledge-management, ai] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: markdown framework: llm-agnostic --- # 자연어 아티팩트 (Natural Language Artifacts) ## 매 한 줄 > **"매 prompt 는 source code 다 — version, review, test 가 필요한 first-class artifact"**. 2026 production AI 에서 prompt, system instruction, eval set, persona spec 은 모두 git-tracked, schema-validated, CI-tested artifact 로 다뤄진다. NL artifact 의 lifecycle (author → review → eval → deploy → monitor) 가 코드와 동일한 rigor 로 운영되어야 한다. ## 매 핵심 ### 매 NL artifact 의 종류 - **System prompt**: agent persona, behavior contract. - **Few-shot examples**: in-context demonstration. - **Eval set**: input/expected pairs, rubric. - **Tool spec**: function description, parameter schema. - **Memory document**: long-term context, RAG-ingested. - **Output template**: structured response scaffold. ### 매 lifecycle - **Author**: structured markdown, frontmatter metadata. - **Review**: PR review, lint (length, banned terms). - **Eval**: regression suite — golden set, LLM-as-judge. - **Deploy**: versioned, A/B routed. - **Monitor**: drift, refusal rate, latency. ### 매 응용 1. Agent system prompt 의 versioned management. 2. RAG knowledge base 의 chunk metadata. 3. LLM eval framework 의 test artifact. ## 💻 패턴 ### Prompt as artifact (frontmatter + body) ```markdown --- id: prompt-customer-support-v3 version: 3.2.1 model: claude-opus-4-7 owner: support-team eval_set: ./evals/customer-support.jsonl last_reviewed: 2026-05-09 --- # Customer Support Agent You are a senior support agent for {{product}}. Tone: empathetic, concise. ## Rules - Never make refund decisions over $500 — escalate. - Cite KB article IDs in [KB-1234] format. ## Output format Return JSON: ```json { "reply": "...", "escalate": false, "kb_refs": [] } ``` ``` ### Versioned prompt registry (Python) ```python from pathlib import Path import yaml, frontmatter class PromptRegistry: def __init__(self, root="prompts/"): self.root = Path(root) self._cache = {} def get(self, prompt_id: str, version: str = "latest") -> str: path = self.root / f"{prompt_id}.md" if version != "latest": path = self.root / "history" / f"{prompt_id}@{version}.md" post = frontmatter.load(path) return post.content, post.metadata text, meta = PromptRegistry().get("customer-support", "3.2.1") ``` ### Eval set format ```jsonl {"id":"refund-large","input":"I want $800 refund","expected":{"escalate":true,"reply_contains":"escalate"}} {"id":"polite-greet","input":"hi","expected":{"reply_contains":"hello","escalate":false}} {"id":"kb-cite","input":"how to reset password","expected":{"kb_refs_min":1}} ``` ### CI: prompt regression test ```python # tests/test_prompts.py import pytest, json from anthropic import Anthropic from prompts import PromptRegistry client = Anthropic() registry = PromptRegistry() @pytest.mark.parametrize("case", load_jsonl("evals/customer-support.jsonl")) def test_customer_support(case): sys, _ = registry.get("customer-support") resp = client.messages.create( model="claude-opus-4-7", system=sys, max_tokens=512, messages=[{"role":"user","content":case["input"]}], ) out = json.loads(resp.content[0].text) if "escalate" in case["expected"]: assert out["escalate"] == case["expected"]["escalate"] if "reply_contains" in case["expected"]: assert case["expected"]["reply_contains"].lower() in out["reply"].lower() ``` ### LLM-as-judge eval ```python JUDGE_PROMPT = """Rate the response 1-5 on: - helpfulness, - tone (empathetic), - rule adherence (no refund > $500) Return JSON: {"helpfulness":N,"tone":N,"rule":N,"reasoning":"..."}""" def judge(input_text, response): resp = client.messages.create( model="claude-opus-4-7", system=JUDGE_PROMPT, max_tokens=400, messages=[{"role":"user", "content":f"INPUT:{input_text}\nRESPONSE:{response}"}], ) return json.loads(resp.content[0].text) ``` ### Prompt diff for review ```bash # Custom git driver for prompt diff git diff prompts/customer-support.md # Visualizes: section moved, rule changed, version bumped ``` ### Memory document 의 RAG ingest ```python import frontmatter from langchain.text_splitter import MarkdownHeaderTextSplitter def ingest(path): post = frontmatter.load(path) splitter = MarkdownHeaderTextSplitter([("#","h1"),("##","h2")]) chunks = splitter.split_text(post.content) for c in chunks: c.metadata.update(post.metadata) # propagate id, version, owner vector_store.upsert(c) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | 1-shot script | inline string OK | | production agent | versioned registry | | changes 잦음 | feature flag + A/B | | critical correctness | eval set + CI gate | | domain knowledge | memory doc + RAG | | structured output | JSON schema in prompt | **기본값**: markdown + frontmatter + git + eval CI. ## 🔗 Graph - 부모: [[Prompt Engineering]] - 응용: [[Agent Architecture]] ## 🤖 LLM 활용 **언제**: prompt scaffold authoring, eval set bootstrapping, judge rubric design, memory doc summarization. **언제 X**: legal/safety-critical prompt 의 final approval — human review 필수. ## ❌ 안티패턴 - **String literal in code**: untracked, untestable, untraceable. - **No version pinning**: silent prompt drift breaks production. - **Eval set as afterthought**: write evals AFTER bug — by definition incomplete. - **Mixed concerns**: persona + tool spec + RAG context 매 single prompt 에 dump. ## 🧪 검증 / 중복 - Verified (Anthropic prompt engineering docs 2025, OpenAI evals repo, LangSmith patterns). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — NL artifact lifecycle + eval CI patterns. |