Files
2nd/10_Wiki/Topics/Architecture/데이터_파싱(Data_Parsing).md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

4.8 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-데이터-파싱-data-parsing 데이터 파싱(Data Parsing) 10_Wiki/Topics verified self
Data Parsing
파싱
Parsing
none A 0.9 applied
parsing
data
architecture
2026-05-10 pending
language framework
python pydantic

데이터 파싱(Data Parsing)

매 한 줄

"매 raw bytes → structured value". 매 parsing 의 input string/bytes 의 grammar 의 따라 typed tree 의 변환 — 매 modern stack 의 schema-first (Pydantic v2, Zod, Protobuf) + zero-copy (Arrow, simdjson) 의 dominant.

매 핵심

매 Parsing vs Validation

  • Parsing: bytes → typed structure (lossless, total).
  • Validation: typed structure → invariant check (predicate).
  • Parse, don't validate (Alexis King 2019): 매 type system 의 invariant 의 encode → downstream 의 unwrap 의 X.

매 Parser 종류

  • Recursive descent: 매 hand-written, LL(k). 매 readable.
  • Parser combinator: 매 functional composition (parsec, nom).
  • PEG: 매 ordered choice, 매 ambiguity 의 X.
  • LR/LALR: 매 bottom-up, 매 yacc/bison.
  • Schema-driven: 매 Pydantic, Zod — 매 declarative type → parser 의 derive.

매 응용

  1. JSON/YAML/TOML config 의 load.
  2. API request body 의 validation (FastAPI + Pydantic).
  3. Log/CSV ingest pipeline.
  4. DSL (SQL, GraphQL) 의 compile.
  5. Protocol (HTTP, Protobuf) 의 decode.

💻 패턴

Pydantic v2 (schema-first, Rust core)

from pydantic import BaseModel, Field, EmailStr, ValidationError
from datetime import datetime

class User(BaseModel):
    id: int
    email: EmailStr
    created_at: datetime
    tags: list[str] = Field(default_factory=list, max_length=10)

try:
    user = User.model_validate_json(raw_bytes)  # bytes → typed
except ValidationError as e:
    print(e.errors())  # structured error path

Parser combinator (Python, lark)

from lark import Lark, Transformer

grammar = r"""
    start: expr
    expr: NUMBER ("+" NUMBER)*
    %import common.NUMBER
    %ignore " "
"""

class Eval(Transformer):
    def expr(self, items): return sum(int(t) for t in items)

print(Lark(grammar, parser="lalr", transformer=Eval()).parse("1 + 2 + 3"))  # 6

Streaming JSON (ijson, large files)

import ijson

with open("huge.jsonl", "rb") as f:
    for record in ijson.items(f, "item"):
        process(record)  # constant memory

simdjson (zero-copy, SIMD)

import simdjson
parser = simdjson.Parser()
doc = parser.parse(raw_bytes)  # ~3 GB/s
name = doc["user"]["name"]     # lazy access

Zod (TypeScript)

import { z } from "zod";

const User = z.object({
  id: z.number().int().positive(),
  email: z.string().email(),
  tags: z.array(z.string()).max(10).default([]),
});

const user = User.parse(JSON.parse(raw));  // throws on invalid
type User = z.infer<typeof User>;          // static type

Protobuf (binary, schema)

import user_pb2
user = user_pb2.User()
user.ParseFromString(raw_bytes)  # zero-allocation in C++

Apache Arrow (columnar, zero-copy)

import pyarrow.csv as pv
table = pv.read_csv("data.csv")  # multi-threaded, zero-copy to pandas
df = table.to_pandas(zero_copy_only=True)

매 결정 기준

상황 Approach
API boundary, dynamic Pydantic v2 / Zod
Large JSON throughput simdjson
Streaming large file ijson / SAX
Inter-service binary Protobuf / Cap'n Proto
Custom DSL Lark / nom / chumsky
Tabular bulk Arrow / Polars

기본값: Pydantic v2 (Python), Zod (TS), Serde (Rust).

🔗 Graph

🤖 LLM 활용

언제: structured output enforcement (JSON schema → Pydantic), tool call argument 의 parse. 언제 X: free-form natural language — 매 parsing 의 wrong tool, embedding/LLM 의 use.

안티패턴

  • Regex 의 HTML/JSON parse: 매 grammar 의 non-regular, 의 use proper parser.
  • Stringly-typed: 매 dict[str, Any] 의 propagate, 의 typed model 의 boundary 의 parse.
  • Validate-only: 매 raw dict 의 keep + ad-hoc check, 의 parse 의 structure 의 commit.
  • Eager full-load: 매 GB JSON 의 json.load, 의 streaming 의 use.
  • Silent coercion: "1" → 1 implicit, 의 strict mode (Pydantic strict=True).

🧪 검증 / 중복

  • Verified (Pydantic v2 docs, Alexis King "Parse, don't validate" 2019, simdjson paper).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — full data parsing entry, 7 patterns + decision matrix