Files
2nd/10_Wiki/Topics/Computer_Science_and_Theory/Structured Data.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

6.2 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-structured-data Structured Data 10_Wiki/Topics verified self
Tabular-Data
Schema-Data
none A 0.9 applied
data
schema
tabular
structured
formats
2026-05-10 pending
language framework
python pandas-polars-arrow

Structured Data

매 한 줄

"매 schema 가 있는 data — row × column 의 표". Structured 의 strict schema (RDB, Parquet), semi-structured 의 flexible (JSON, XML), unstructured 의 free-form (text, image). 1970 Codd's relational model 에서 시작 — 2026 에 Parquet/Arrow + JSON Schema + Pydantic 의 표준.

매 핵심

매 spectrum

  • Fully structured: RDB, Parquet, CSV (with schema). 매 row=record, column=field, type 의 fixed.
  • Semi-structured: JSON, XML, YAML, Avro. 매 schema 의 optional, flexible.
  • Unstructured: text, image, audio, video. 매 schema 없음.
  • 매 LLM era 의 unstructured → embedding 으로 structured 화 의 가능.

매 representation

  • Row-oriented: CSV, JSON Lines, RDB OLTP. 매 record-at-a-time access.
  • Column-oriented: Parquet, ORC, Arrow. 매 analytic scan / compression 의 우월.
  • Document: MongoDB, Elasticsearch. 매 nested, schema-on-read.
  • Graph: Neo4j, JanusGraph. 매 edge-first.

매 schema

  • Schema-on-write: RDB, Parquet — 매 write 시 validation. 매 strict, fast read.
  • Schema-on-read: JSON, MongoDB — 매 read 시 interpret. 매 flexible, slower read.
  • Schema evolution: Avro, Protobuf — 매 backward/forward compat.

매 응용

  1. Data warehouse — Parquet on S3 + Snowflake/BigQuery.
  2. ML training — TFRecord, Parquet, HuggingFace datasets.
  3. API — JSON + JSON Schema / OpenAPI.
  4. Streaming — Avro / Protobuf in Kafka.
  5. LLM tool calling — Pydantic schema → JSON.

💻 패턴

1. Parquet — write/read with schema

import pandas as pd
df.to_parquet("data.parquet", engine="pyarrow", compression="zstd")
df2 = pd.read_parquet("data.parquet", columns=["id", "value"])  # 매 column projection

2. Polars (lazy, 2026 fast)

import polars as pl
lf = pl.scan_parquet("data/*.parquet")
result = (
    lf.filter(pl.col("value") > 100)
      .group_by("category")
      .agg(pl.col("value").mean())
      .collect()
)

3. Arrow (zero-copy interchange)

import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table.from_pandas(df)
pq.write_table(table, "data.parquet")
# 매 pandas, polars, duckdb 사이 zero-copy

4. Pydantic (Python schema validation, 2026)

from pydantic import BaseModel, Field
class User(BaseModel):
    id: int
    name: str = Field(min_length=1)
    email: str
    age: int | None = None

user = User.model_validate({"id": 1, "name": "Alice", "email": "a@b.com"})
schema = User.model_json_schema()  # 매 LLM tool calling 의 사용 가능

5. JSON Schema validation

import jsonschema
schema = {"type": "object", "properties": {"id": {"type": "integer"}}, "required": ["id"]}
jsonschema.validate(instance={"id": 42}, schema=schema)

6. SQL (DuckDB on local Parquet, 2026)

import duckdb
con = duckdb.connect()
result = con.sql("""
    SELECT category, AVG(value) FROM 'data/*.parquet'
    WHERE date >= '2026-01-01' GROUP BY category
""").df()

7. Schema evolution (Avro)

# 매 v1 → v2: optional field 의 add (with default) → backward compat
schema_v2 = {
    "type": "record",
    "fields": [
        {"name": "id", "type": "int"},
        {"name": "name", "type": "string"},
        {"name": "tier", "type": "string", "default": "free"}  # NEW
    ]
}

8. LLM structured output (Pydantic + Anthropic)

from anthropic import Anthropic
from pydantic import BaseModel

class Extract(BaseModel):
    name: str
    age: int

# 매 tool calling with schema
schema = Extract.model_json_schema()
client = Anthropic()
resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    tools=[{"name": "extract", "input_schema": schema}],
    messages=[{"role": "user", "content": "Alice is 30 years old."}],
)

9. Iceberg / Delta (2026 lakehouse)

# Iceberg — Parquet + manifest + snapshot 의 ACID
from pyiceberg.catalog import load_catalog
cat = load_catalog("local")
table = cat.load_table("ns.users")
df = table.scan().to_pandas()

매 결정 기준

상황 Approach
Analytic / scan-heavy Parquet (column)
Transactional / row-by-row RDB / RocksDB
Streaming, schema evolve Avro / Protobuf
Human-edit, config YAML / TOML
API contract JSON + Pydantic / JSON Schema / OpenAPI
Data lake Parquet + Iceberg / Delta
ML training Parquet / HuggingFace datasets / TFRecord
LLM tool input JSON Schema (from Pydantic)

기본값: Parquet for storage + Pydantic for validation + DuckDB/Polars for query.

🔗 Graph

🤖 LLM 활용

언제: schema design, JSON-Schema generation, tool calling input/output, data validation guidance, format conversion explanation. 언제 X: 매 large-volume parse — pandas/polars/duckdb 의 사용.

안티패턴

  • CSV for production data: type loss, encoding bugs, no compression — 매 Parquet 의 default.
  • JSON for analytic at scale: 매 row-by-row parse 의 slow, 100x larger 의 Parquet vs.
  • No schema validation at boundary: 매 silent corruption — Pydantic / JSON Schema 의 enforce.
  • Schema-on-read for hot path: 매 latency penalty — 매 schema-on-write + indices.
  • Wide tables (1000+ cols): 매 most ops 의 unused — 매 column projection / split table.

🧪 검증 / 중복

  • Verified (Apache Parquet/Arrow specs, Pydantic v2 docs, Iceberg spec).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — formats spectrum, Parquet/Pydantic/DuckDB patterns.