--- id: wiki-2026-0508-데이터-파싱-data-parsing title: 데이터 파싱(Data Parsing) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Data Parsing, 파싱, Parsing] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [parsing, data, architecture] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pydantic --- # 데이터 파싱(Data Parsing) ## 매 한 줄 > **"매 raw bytes → structured value"**. 매 parsing 의 input string/bytes 의 grammar 의 따라 typed tree 의 변환 — 매 modern stack 의 schema-first (Pydantic v2, Zod, Protobuf) + zero-copy (Arrow, simdjson) 의 dominant. ## 매 핵심 ### 매 Parsing vs Validation - **Parsing**: bytes → typed structure (lossless, total). - **Validation**: typed structure → invariant check (predicate). - **Parse, don't validate** (Alexis King 2019): 매 type system 의 invariant 의 encode → downstream 의 unwrap 의 X. ### 매 Parser 종류 - **Recursive descent**: 매 hand-written, LL(k). 매 readable. - **Parser combinator**: 매 functional composition (parsec, nom). - **PEG**: 매 ordered choice, 매 ambiguity 의 X. - **LR/LALR**: 매 bottom-up, 매 yacc/bison. - **Schema-driven**: 매 Pydantic, Zod — 매 declarative type → parser 의 derive. ### 매 응용 1. JSON/YAML/TOML config 의 load. 2. API request body 의 validation (FastAPI + Pydantic). 3. Log/CSV ingest pipeline. 4. DSL (SQL, GraphQL) 의 compile. 5. Protocol (HTTP, Protobuf) 의 decode. ## 💻 패턴 ### Pydantic v2 (schema-first, Rust core) ```python from pydantic import BaseModel, Field, EmailStr, ValidationError from datetime import datetime class User(BaseModel): id: int email: EmailStr created_at: datetime tags: list[str] = Field(default_factory=list, max_length=10) try: user = User.model_validate_json(raw_bytes) # bytes → typed except ValidationError as e: print(e.errors()) # structured error path ``` ### Parser combinator (Python, lark) ```python from lark import Lark, Transformer grammar = r""" start: expr expr: NUMBER ("+" NUMBER)* %import common.NUMBER %ignore " " """ class Eval(Transformer): def expr(self, items): return sum(int(t) for t in items) print(Lark(grammar, parser="lalr", transformer=Eval()).parse("1 + 2 + 3")) # 6 ``` ### Streaming JSON (ijson, large files) ```python import ijson with open("huge.jsonl", "rb") as f: for record in ijson.items(f, "item"): process(record) # constant memory ``` ### simdjson (zero-copy, SIMD) ```python import simdjson parser = simdjson.Parser() doc = parser.parse(raw_bytes) # ~3 GB/s name = doc["user"]["name"] # lazy access ``` ### Zod (TypeScript) ```typescript import { z } from "zod"; const User = z.object({ id: z.number().int().positive(), email: z.string().email(), tags: z.array(z.string()).max(10).default([]), }); const user = User.parse(JSON.parse(raw)); // throws on invalid type User = z.infer; // static type ``` ### Protobuf (binary, schema) ```python import user_pb2 user = user_pb2.User() user.ParseFromString(raw_bytes) # zero-allocation in C++ ``` ### Apache Arrow (columnar, zero-copy) ```python import pyarrow.csv as pv table = pv.read_csv("data.csv") # multi-threaded, zero-copy to pandas df = table.to_pandas(zero_copy_only=True) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | API boundary, dynamic | Pydantic v2 / Zod | | Large JSON throughput | simdjson | | Streaming large file | ijson / SAX | | Inter-service binary | Protobuf / Cap'n Proto | | Custom DSL | Lark / nom / chumsky | | Tabular bulk | Arrow / Polars | **기본값**: Pydantic v2 (Python), Zod (TS), Serde (Rust). ## 🔗 Graph - 부모: [[Software Architecture]] · [[TypeScript 타입 시스템 (TypeScript Type System)|Type Systems]] - 변형: [[Protobuf]] - Adjacent: [[Validation]] · [[Schema Design]] ## 🤖 LLM 활용 **언제**: structured output enforcement (JSON schema → Pydantic), tool call argument 의 parse. **언제 X**: free-form natural language — 매 parsing 의 wrong tool, embedding/LLM 의 use. ## ❌ 안티패턴 - **Regex 의 HTML/JSON parse**: 매 grammar 의 non-regular, 의 use proper parser. - **Stringly-typed**: 매 dict[str, Any] 의 propagate, 의 typed model 의 boundary 의 parse. - **Validate-only**: 매 raw dict 의 keep + ad-hoc check, 의 parse 의 structure 의 commit. - **Eager full-load**: 매 GB JSON 의 json.load, 의 streaming 의 use. - **Silent coercion**: "1" → 1 implicit, 의 strict mode (Pydantic strict=True). ## 🧪 검증 / 중복 - Verified (Pydantic v2 docs, Alexis King "Parse, don't validate" 2019, simdjson paper). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — full data parsing entry, 7 patterns + decision matrix |