[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -2,89 +2,231 @@
 id: wiki-2026-0508-nosql-databases-in-ai
 title: NoSQL Databases in AI
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [SYS-NOSQL-001]
+aliases: [NoSQL for AI, Vector DB, Document Store AI]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 1.0
-tags: [infrastructure, database, nosql, mongodb, cassandra, Big-Data, ai]
+confidence_score: 0.9
+verification_status: applied
+tags: [nosql, ai, vector-db, mongodb, redis, rag]
 raw_sources: []
-last_reinforced: 2026-04-26
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
 tech_stack:
-  language: unspecified
-  framework: unspecified
+  language: python
+  framework: mongodb-redis-pinecone
 ---

-# NoSQL Databases in AI (AI에서의 NoSQL 데이터베이스)
+# NoSQL Databases in AI

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "정형화된 표의 형식을 깨고 데이터의 자유로운 흐름을 수용하여, AI가 필요로 하는 방대한 비정형 지식을 가공 없이 저장하라" — 관계형 데이터베이스(RDBMS)의 엄격한 스키마 제약에서 벗어나, 유연한 구조로 대규모 데이터를 고속으로 처리하고 확장하는 데이터 저장 기술.
+## 매 한 줄
+> **"매 AI workload = embedding + metadata + cache; 매 NoSQL 의 each layer 의 fit"**. 2026 RAG / agent stack 의 매 standard: vector DB (Pinecone/Qdrant/pgvector) + document store (MongoDB) + KV cache (Redis). 매 schema flexibility + horizontal scale 의 LLM-era natural fit.

-## 📖 구조화된 지식 (Synthesized Content)
- **추출된 패턴:** "[[Schema|Schema]]-less [[Scalability|Scalability]] and Document-oriented [[Storage|Storage]]" — 데이터의 형태가 고정되지 않은 웹 로그, SNS 텍스트, 센서 데이터 등을 JSON과 유사한 문서 형태로 저장하거나 키-값 쌍으로 관리함으로써, 데이터 모델의 변경에 유연하게 대응하고 수평적 확장(Scaling out)을 용이하게 하는 패턴.
- **주요 유형:**
-    - **Document-oriented (MongoDB):** JSON 형태의 유연한 데이터 저장. 에이전트의 대화 로그나 설정 관리에 적합.
-    - **Key-Value (Redis):** 초고속 인메모리 저장소. 실시간 피드백 및 캐싱에 활용.
-    - **Column-family (Cassandra, HBase):** 대규모 분산 데이터 처리. 로그 분석 및 시계열 데이터에 최적.
-    - **Graph Database (Neo4j):** 개체 간의 복잡한 연결 관계(Knowledge Graph) 표현.
- **의의:** 정형 데이터보다 비정형 데이터의 비중이 압도적인 현대 AI 학습 환경에서, 데이터 파이프라인의 유연성과 병목 현상 해소를 보장하는 인프라적 토대.
+## 매 핵심

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌:** NoSQL은 데이터 정밀도나 트랜잭션 안전성(ACID)이 떨어진다는 우려가 있었으나, 최근에는 NewSQL의 등장과 각 DB 엔진의 기능 개선으로 고도의 일관성과 확장성을 동시에 확보하는 방향으로 진화함.
- **정책 변화:** Antigravity 프로젝트는 에이전트가 수집하는 원시 지식(Raw Data)의 임시 저장 및 사고 로그 기록 시 MongoDB와 Redis 기반의 NoSQL 아키텍처를 우선적으로 적용함.
+### 매 NoSQL family + AI role
+- **Vector**: embedding similarity (RAG, recommendation). Pinecone, Qdrant, Weaviate, Milvus.
+- **Document**: chat history, agent state, structured output. MongoDB, CouchDB.
+- **KV / Cache**: prompt cache, semantic cache, session. Redis, DragonflyDB.
+- **Graph**: knowledge graph, entity link. Neo4j, ArangoDB.
+- **Wide-column**: time-series telemetry, traces. Cassandra, ScyllaDB.

-## 🔗 지식 연결 (Graph)
- [[Indexing-Strategies|Indexing-Strategies]], [[Knowledge-Graph-Foundations|Knowledge-Graph-Foundations]], [[High-Availability-Systems|High-Availability-Systems]],[[_system|system]]-Design-for-AI-Scale
- **Raw Source:** 10_Wiki/Topics/AI/NoSQL-Databases-in-AI.md
+### 매 access pattern
+- **Embed + ANN search**: HNSW / IVF index, top-k cosine.
+- **Metadata filter + vector**: hybrid search.
+- **TTL cache**: prompt → response 의 24h cache.
+- **Append-only chat log**: doc store + per-user shard.

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+### 매 응용
+1. RAG: vector + document hybrid.
+2. Agent memory: document (short) + vector (long-term).
+3. Personalization: KV (recent) + graph (relations).

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+## 💻 패턴

-**언제 쓰면 안 되는가:**
- *(TODO)*
+### Pinecone (managed vector)
+```python
+from pinecone import Pinecone, ServerlessSpec

-## 🧪 검증 상태 (Validation)
+pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
+pc.create_index(
+    name="docs",
+    dimension=1536,
+    metric="cosine",
+    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
+)
+idx = pc.Index("docs")

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+idx.upsert(vectors=[
+    {"id": "doc1", "values": embed("Hello world"),
+     "metadata": {"source": "intro.md", "section": "overview"}},
+])

-## 🧬 중복 검사 (Duplicate Check)
-
- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
-
-## 🕓 변경 이력 (Changelog)
-
-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
-
-## 💻 코드 패턴 (Code Patterns)
-
-**패턴 1:** *(TODO: 이 프로젝트 컨벤션 반영한 구조 스켈레톤)*
-
-```text
-# TODO
+res = idx.query(
+    vector=embed("greeting example"),
+    top_k=5,
+    filter={"source": {"$eq": "intro.md"}},
+    include_metadata=True,
+)
 ```

-## 🤔 의사결정 기준 (Decision Criteria)
+### Qdrant (self-host)
+```python
+from qdrant_client import QdrantClient
+from qdrant_client.models import VectorParams, Distance, PointStruct

-**선택 A를 써야 할 때:**
- *(TODO)*
+client = QdrantClient("localhost", port=6333)
+client.recreate_collection(
+    "docs",
+    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
+)
+client.upsert("docs", points=[
+    PointStruct(id=1, vector=embed(text), payload={"text": text, "source": "x.md"}),
+])

-**선택 B를 써야 할 때:**
- *(TODO)*
+hits = client.search("docs", query_vector=embed(q), limit=5,
+                    query_filter={"must": [{"key": "source", "match": {"value": "x.md"}}]})
+```

-**기본값:**
-> *(TODO)*
+### pgvector (Postgres + vector)
+```sql
+CREATE EXTENSION vector;
+CREATE TABLE docs (
+  id BIGSERIAL PRIMARY KEY,
+  text TEXT,
+  source TEXT,
+  embedding vector(1536)
+);
+CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);

-## ❌ 안티패턴 (Anti-Patterns)
+-- hybrid query
+SELECT id, text FROM docs
+WHERE source = 'intro.md'
+ORDER BY embedding <=> $1::vector
+LIMIT 5;
+```

- **[안티패턴]:** *(TODO: 무엇을 하면 안 되는가 + 이유 + 대신 무엇을)*
+### MongoDB Atlas Vector Search
+```javascript
+import { MongoClient } from "mongodb";
+const col = new MongoClient(uri).db("ai").collection("docs");
+
+await col.aggregate([
+  { $vectorSearch: {
+      index: "doc_embedding",
+      path: "embedding",
+      queryVector: await embed(q),
+      numCandidates: 100,
+      limit: 5,
+      filter: { source: "intro.md" },
+  }},
+  { $project: { text: 1, score: { $meta: "vectorSearchScore" } } },
+]).toArray();
+```
+
+### Redis semantic cache
+```python
+import redis, hashlib, json
+r = redis.Redis()
+
+def cached_completion(prompt: str, ttl=3600):
+    key = "ai:" + hashlib.sha256(prompt.encode()).hexdigest()
+    if v := r.get(key): return json.loads(v)
+    out = anthropic.messages.create(
+        model="claude-opus-4-7",
+        messages=[{"role":"user","content":prompt}],
+        max_tokens=1024,
+    )
+    r.setex(key, ttl, json.dumps({"text": out.content[0].text}))
+    return {"text": out.content[0].text}
+```
+
+### Redis Vector (semantic cache, near-match)
+```python
+# RediSearch + HNSW
+from redis.commands.search.field import VectorField, TextField
+from redis.commands.search.indexDefinition import IndexDefinition
+
+r.ft("ai_cache").create_index(
+    [TextField("prompt"),
+     VectorField("v", "HNSW", {"TYPE":"FLOAT32","DIM":1536,"DISTANCE_METRIC":"COSINE"})],
+    definition=IndexDefinition(prefix=["cache:"]),
+)
+```
+
+### Agent state (MongoDB)
+```python
+from pymongo import MongoClient
+c = MongoClient(uri).agents.runs
+
+run_id = c.insert_one({
+    "user_id": "u1",
+    "messages": [{"role":"system","content":"..."}],
+    "tool_calls": [],
+    "status": "running",
+    "created_at": datetime.utcnow(),
+}).inserted_id
+
+c.update_one({"_id": run_id}, {"$push": {"messages": new_msg}})
+```
+
+### Knowledge graph (Neo4j)
+```cypher
+MERGE (a:Person {name: 'Alice'})
+MERGE (c:Company {name: 'Acme'})
+MERGE (a)-[:WORKS_AT {since: 2020}]->(c)
+```
+
+```python
+# answer "who at Acme works with Alice's manager?"
+session.run("""
+MATCH (a:Person {name: $n})-[:WORKS_AT]->(c)<-[:WORKS_AT]-(p)
+RETURN p.name LIMIT 10
+""", n="Alice")
+```
+
+### Hybrid retrieval (BM25 + vector)
+```python
+keyword_hits = es.search(index="docs", query={"match": {"text": q}})
+vector_hits = idx.query(vector=embed(q), top_k=10).matches
+fused = reciprocal_rank_fusion([keyword_hits, vector_hits], k=60)
+```
+
+## 매 결정 기준
+| 상황 | Approach |
+|---|---|
+| Small RAG (<1M docs) | pgvector (single Postgres) |
+| Medium RAG (1M-100M) | Qdrant / Weaviate self-host |
+| Large / managed | Pinecone / MongoDB Atlas |
+| Agent state + chat | MongoDB document store |
+| Prompt cache | Redis (exact + semantic) |
+| Entity reasoning | Neo4j |
+
+**기본값**: pgvector (start) → Qdrant (scale) + Redis cache + MongoDB for agent state.
+
+## 🔗 Graph
+- 부모: [[Database-Systems]] · [[AI-Infrastructure]]
+- 변형: [[Vector-Database]] · [[Document-Database]] · [[Graph-Database]]
+- 응용: [[RAG]] · [[Semantic-Search]] · [[Agent-Memory]]
+- Adjacent: [[Embeddings]] · [[Hybrid-Search]] · [[pgvector]]
+
+## 🤖 LLM 활용
+**언제**: 매 schema design 의 propose, query construction 의 boilerplate, hybrid-search blend 의 tune.
+**언제 X**: 매 capacity / cost projection, ANN index parameter (M, efConstruction) tuning — measure on real workload.
+
+## ❌ 안티패턴
+- **Vector DB only**: 매 metadata filter 의 ignore = 매 irrelevant top-k.
+- **No re-ranker**: top-50 vector hits 의 직접 LLM 의 feed = noise. Cohere Rerank or cross-encoder.
+- **Cache prompt verbatim**: 매 1-char diff = 매 cache miss. Use semantic cache.
+- **Mixing OLTP + vector**: 매 single Postgres 의 both = 매 index bloat. Separate.
+
+## 🧪 검증 / 중복
+- Verified (Pinecone docs, Qdrant docs, MongoDB Atlas Vector Search 2025, "Designing Data-Intensive Applications", LangChain RAG cookbook).
+- 신뢰도 A.
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — NoSQL families mapped to AI/RAG/agent workloads |