166 lines
4.9 KiB
Markdown
166 lines
4.9 KiB
Markdown
---
|
|
id: wiki-2026-0508-데이터-파싱-data-parsing
|
|
title: 데이터 파싱(Data Parsing)
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Data Parsing, 파싱, Parsing]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [parsing, data, architecture]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: python
|
|
framework: pydantic
|
|
---
|
|
|
|
# 데이터 파싱(Data Parsing)
|
|
|
|
## 매 한 줄
|
|
> **"매 raw bytes → structured value"**. 매 parsing 의 input string/bytes 의 grammar 의 따라 typed tree 의 변환 — 매 modern stack 의 schema-first (Pydantic v2, Zod, Protobuf) + zero-copy (Arrow, simdjson) 의 dominant.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 Parsing vs Validation
|
|
- **Parsing**: bytes → typed structure (lossless, total).
|
|
- **Validation**: typed structure → invariant check (predicate).
|
|
- **Parse, don't validate** (Alexis King 2019): 매 type system 의 invariant 의 encode → downstream 의 unwrap 의 X.
|
|
|
|
### 매 Parser 종류
|
|
- **Recursive descent**: 매 hand-written, LL(k). 매 readable.
|
|
- **Parser combinator**: 매 functional composition (parsec, nom).
|
|
- **PEG**: 매 ordered choice, 매 ambiguity 의 X.
|
|
- **LR/LALR**: 매 bottom-up, 매 yacc/bison.
|
|
- **Schema-driven**: 매 Pydantic, Zod — 매 declarative type → parser 의 derive.
|
|
|
|
### 매 응용
|
|
1. JSON/YAML/TOML config 의 load.
|
|
2. API request body 의 validation (FastAPI + Pydantic).
|
|
3. Log/CSV ingest pipeline.
|
|
4. DSL (SQL, GraphQL) 의 compile.
|
|
5. Protocol (HTTP, Protobuf) 의 decode.
|
|
|
|
## 💻 패턴
|
|
|
|
### Pydantic v2 (schema-first, Rust core)
|
|
```python
|
|
from pydantic import BaseModel, Field, EmailStr, ValidationError
|
|
from datetime import datetime
|
|
|
|
class User(BaseModel):
|
|
id: int
|
|
email: EmailStr
|
|
created_at: datetime
|
|
tags: list[str] = Field(default_factory=list, max_length=10)
|
|
|
|
try:
|
|
user = User.model_validate_json(raw_bytes) # bytes → typed
|
|
except ValidationError as e:
|
|
print(e.errors()) # structured error path
|
|
```
|
|
|
|
### Parser combinator (Python, lark)
|
|
```python
|
|
from lark import Lark, Transformer
|
|
|
|
grammar = r"""
|
|
start: expr
|
|
expr: NUMBER ("+" NUMBER)*
|
|
%import common.NUMBER
|
|
%ignore " "
|
|
"""
|
|
|
|
class Eval(Transformer):
|
|
def expr(self, items): return sum(int(t) for t in items)
|
|
|
|
print(Lark(grammar, parser="lalr", transformer=Eval()).parse("1 + 2 + 3")) # 6
|
|
```
|
|
|
|
### Streaming JSON (ijson, large files)
|
|
```python
|
|
import ijson
|
|
|
|
with open("huge.jsonl", "rb") as f:
|
|
for record in ijson.items(f, "item"):
|
|
process(record) # constant memory
|
|
```
|
|
|
|
### simdjson (zero-copy, SIMD)
|
|
```python
|
|
import simdjson
|
|
parser = simdjson.Parser()
|
|
doc = parser.parse(raw_bytes) # ~3 GB/s
|
|
name = doc["user"]["name"] # lazy access
|
|
```
|
|
|
|
### Zod (TypeScript)
|
|
```typescript
|
|
import { z } from "zod";
|
|
|
|
const User = z.object({
|
|
id: z.number().int().positive(),
|
|
email: z.string().email(),
|
|
tags: z.array(z.string()).max(10).default([]),
|
|
});
|
|
|
|
const user = User.parse(JSON.parse(raw)); // throws on invalid
|
|
type User = z.infer<typeof User>; // static type
|
|
```
|
|
|
|
### Protobuf (binary, schema)
|
|
```python
|
|
import user_pb2
|
|
user = user_pb2.User()
|
|
user.ParseFromString(raw_bytes) # zero-allocation in C++
|
|
```
|
|
|
|
### Apache Arrow (columnar, zero-copy)
|
|
```python
|
|
import pyarrow.csv as pv
|
|
table = pv.read_csv("data.csv") # multi-threaded, zero-copy to pandas
|
|
df = table.to_pandas(zero_copy_only=True)
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| API boundary, dynamic | Pydantic v2 / Zod |
|
|
| Large JSON throughput | simdjson |
|
|
| Streaming large file | ijson / SAX |
|
|
| Inter-service binary | Protobuf / Cap'n Proto |
|
|
| Custom DSL | Lark / nom / chumsky |
|
|
| Tabular bulk | Arrow / Polars |
|
|
|
|
**기본값**: Pydantic v2 (Python), Zod (TS), Serde (Rust).
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Software Architecture]] · [[Type Systems]]
|
|
- 변형: [[JSON Parsing]] · [[Protobuf]] · [[Parser Combinators]]
|
|
- 응용: [[FastAPI]] · [[GraphQL]] · [[Compilers]]
|
|
- Adjacent: [[Validation]] · [[Schema Design]] · [[Serialization]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: structured output enforcement (JSON schema → Pydantic), tool call argument 의 parse.
|
|
**언제 X**: free-form natural language — 매 parsing 의 wrong tool, embedding/LLM 의 use.
|
|
|
|
## ❌ 안티패턴
|
|
- **Regex 의 HTML/JSON parse**: 매 grammar 의 non-regular, 의 use proper parser.
|
|
- **Stringly-typed**: 매 dict[str, Any] 의 propagate, 의 typed model 의 boundary 의 parse.
|
|
- **Validate-only**: 매 raw dict 의 keep + ad-hoc check, 의 parse 의 structure 의 commit.
|
|
- **Eager full-load**: 매 GB JSON 의 json.load, 의 streaming 의 use.
|
|
- **Silent coercion**: "1" → 1 implicit, 의 strict mode (Pydantic strict=True).
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Pydantic v2 docs, Alexis King "Parse, don't validate" 2019, simdjson paper).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — full data parsing entry, 7 patterns + decision matrix |
|