Files
2nd/10_Wiki/Topics/Architecture/데이터_파싱(Data_Parsing).md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

165 lines
4.8 KiB
Markdown

---
id: wiki-2026-0508-데이터-파싱-data-parsing
title: 데이터 파싱(Data Parsing)
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Data Parsing, 파싱, Parsing]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [parsing, data, architecture]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: pydantic
---
# 데이터 파싱(Data Parsing)
## 매 한 줄
> **"매 raw bytes → structured value"**. 매 parsing 의 input string/bytes 의 grammar 의 따라 typed tree 의 변환 — 매 modern stack 의 schema-first (Pydantic v2, Zod, Protobuf) + zero-copy (Arrow, simdjson) 의 dominant.
## 매 핵심
### 매 Parsing vs Validation
- **Parsing**: bytes → typed structure (lossless, total).
- **Validation**: typed structure → invariant check (predicate).
- **Parse, don't validate** (Alexis King 2019): 매 type system 의 invariant 의 encode → downstream 의 unwrap 의 X.
### 매 Parser 종류
- **Recursive descent**: 매 hand-written, LL(k). 매 readable.
- **Parser combinator**: 매 functional composition (parsec, nom).
- **PEG**: 매 ordered choice, 매 ambiguity 의 X.
- **LR/LALR**: 매 bottom-up, 매 yacc/bison.
- **Schema-driven**: 매 Pydantic, Zod — 매 declarative type → parser 의 derive.
### 매 응용
1. JSON/YAML/TOML config 의 load.
2. API request body 의 validation (FastAPI + Pydantic).
3. Log/CSV ingest pipeline.
4. DSL (SQL, GraphQL) 의 compile.
5. Protocol (HTTP, Protobuf) 의 decode.
## 💻 패턴
### Pydantic v2 (schema-first, Rust core)
```python
from pydantic import BaseModel, Field, EmailStr, ValidationError
from datetime import datetime
class User(BaseModel):
id: int
email: EmailStr
created_at: datetime
tags: list[str] = Field(default_factory=list, max_length=10)
try:
user = User.model_validate_json(raw_bytes) # bytes → typed
except ValidationError as e:
print(e.errors()) # structured error path
```
### Parser combinator (Python, lark)
```python
from lark import Lark, Transformer
grammar = r"""
start: expr
expr: NUMBER ("+" NUMBER)*
%import common.NUMBER
%ignore " "
"""
class Eval(Transformer):
def expr(self, items): return sum(int(t) for t in items)
print(Lark(grammar, parser="lalr", transformer=Eval()).parse("1 + 2 + 3")) # 6
```
### Streaming JSON (ijson, large files)
```python
import ijson
with open("huge.jsonl", "rb") as f:
for record in ijson.items(f, "item"):
process(record) # constant memory
```
### simdjson (zero-copy, SIMD)
```python
import simdjson
parser = simdjson.Parser()
doc = parser.parse(raw_bytes) # ~3 GB/s
name = doc["user"]["name"] # lazy access
```
### Zod (TypeScript)
```typescript
import { z } from "zod";
const User = z.object({
id: z.number().int().positive(),
email: z.string().email(),
tags: z.array(z.string()).max(10).default([]),
});
const user = User.parse(JSON.parse(raw)); // throws on invalid
type User = z.infer<typeof User>; // static type
```
### Protobuf (binary, schema)
```python
import user_pb2
user = user_pb2.User()
user.ParseFromString(raw_bytes) # zero-allocation in C++
```
### Apache Arrow (columnar, zero-copy)
```python
import pyarrow.csv as pv
table = pv.read_csv("data.csv") # multi-threaded, zero-copy to pandas
df = table.to_pandas(zero_copy_only=True)
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| API boundary, dynamic | Pydantic v2 / Zod |
| Large JSON throughput | simdjson |
| Streaming large file | ijson / SAX |
| Inter-service binary | Protobuf / Cap'n Proto |
| Custom DSL | Lark / nom / chumsky |
| Tabular bulk | Arrow / Polars |
**기본값**: Pydantic v2 (Python), Zod (TS), Serde (Rust).
## 🔗 Graph
- 부모: [[Software Architecture]] · [[TypeScript 타입 시스템 (TypeScript Type System)|Type Systems]]
- 변형: [[Protobuf]]
- Adjacent: [[Validation]] · [[Schema Design]]
## 🤖 LLM 활용
**언제**: structured output enforcement (JSON schema → Pydantic), tool call argument 의 parse.
**언제 X**: free-form natural language — 매 parsing 의 wrong tool, embedding/LLM 의 use.
## ❌ 안티패턴
- **Regex 의 HTML/JSON parse**: 매 grammar 의 non-regular, 의 use proper parser.
- **Stringly-typed**: 매 dict[str, Any] 의 propagate, 의 typed model 의 boundary 의 parse.
- **Validate-only**: 매 raw dict 의 keep + ad-hoc check, 의 parse 의 structure 의 commit.
- **Eager full-load**: 매 GB JSON 의 json.load, 의 streaming 의 use.
- **Silent coercion**: "1" → 1 implicit, 의 strict mode (Pydantic strict=True).
## 🧪 검증 / 중복
- Verified (Pydantic v2 docs, Alexis King "Parse, don't validate" 2019, simdjson paper).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — full data parsing entry, 7 patterns + decision matrix |