"매 schema 가 있는 data — row × column 의 표". Structured 의 strict schema (RDB, Parquet), semi-structured 의 flexible (JSON, XML), unstructured 의 free-form (text, image). 1970 Codd's relational model 에서 시작 — 2026 에 Parquet/Arrow + JSON Schema + Pydantic 의 표준.
매 핵심
매 spectrum
Fully structured: RDB, Parquet, CSV (with schema). 매 row=record, column=field, type 의 fixed.
Semi-structured: JSON, XML, YAML, Avro. 매 schema 의 optional, flexible.
Unstructured: text, image, audio, video. 매 schema 없음.
매 LLM era 의 unstructured → embedding 으로 structured 화 의 가능.
매 representation
Row-oriented: CSV, JSON Lines, RDB OLTP. 매 record-at-a-time access.
Column-oriented: Parquet, ORC, Arrow. 매 analytic scan / compression 의 우월.
Document: MongoDB, Elasticsearch. 매 nested, schema-on-read.
Graph: Neo4j, JanusGraph. 매 edge-first.
매 schema
Schema-on-write: RDB, Parquet — 매 write 시 validation. 매 strict, fast read.
Schema-on-read: JSON, MongoDB — 매 read 시 interpret. 매 flexible, slower read.
Schema evolution: Avro, Protobuf — 매 backward/forward compat.
매 응용
Data warehouse — Parquet on S3 + Snowflake/BigQuery.
ML training — TFRecord, Parquet, HuggingFace datasets.
API — JSON + JSON Schema / OpenAPI.
Streaming — Avro / Protobuf in Kafka.
LLM tool calling — Pydantic schema → JSON.
💻 패턴
1. Parquet — write/read with schema
importpandasaspddf.to_parquet("data.parquet",engine="pyarrow",compression="zstd")df2=pd.read_parquet("data.parquet",columns=["id","value"])# 매 column projection
importpyarrowaspaimportpyarrow.parquetaspqtable=pa.Table.from_pandas(df)pq.write_table(table,"data.parquet")# 매 pandas, polars, duckdb 사이 zero-copy
4. Pydantic (Python schema validation, 2026)
frompydanticimportBaseModel,FieldclassUser(BaseModel):id:intname:str=Field(min_length=1)email:strage:int|None=Noneuser=User.model_validate({"id":1,"name":"Alice","email":"a@b.com"})schema=User.model_json_schema()# 매 LLM tool calling 의 사용 가능
importduckdbcon=duckdb.connect()result=con.sql("""
SELECT category, AVG(value) FROM 'data/*.parquet'
WHERE date >= '2026-01-01' GROUP BY category
""").df()
7. Schema evolution (Avro)
# 매 v1 → v2: optional field 의 add (with default) → backward compatschema_v2={"type":"record","fields":[{"name":"id","type":"int"},{"name":"name","type":"string"},{"name":"tier","type":"string","default":"free"}# NEW]}
8. LLM structured output (Pydantic + Anthropic)
fromanthropicimportAnthropicfrompydanticimportBaseModelclassExtract(BaseModel):name:strage:int# 매 tool calling with schemaschema=Extract.model_json_schema()client=Anthropic()resp=client.messages.create(model="claude-opus-4-7",max_tokens=1024,tools=[{"name":"extract","input_schema":schema}],messages=[{"role":"user","content":"Alice is 30 years old."}],)
9. Iceberg / Delta (2026 lakehouse)
# Iceberg — Parquet + manifest + snapshot 의 ACIDfrompyiceberg.catalogimportload_catalogcat=load_catalog("local")table=cat.load_table("ns.users")df=table.scan().to_pandas()
매 결정 기준
상황
Approach
Analytic / scan-heavy
Parquet (column)
Transactional / row-by-row
RDB / RocksDB
Streaming, schema evolve
Avro / Protobuf
Human-edit, config
YAML / TOML
API contract
JSON + Pydantic / JSON Schema / OpenAPI
Data lake
Parquet + Iceberg / Delta
ML training
Parquet / HuggingFace datasets / TFRecord
LLM tool input
JSON Schema (from Pydantic)
기본값: Parquet for storage + Pydantic for validation + DuckDB/Polars for query.
언제: schema design, JSON-Schema generation, tool calling input/output, data validation guidance, format conversion explanation.
언제 X: 매 large-volume parse — pandas/polars/duckdb 의 사용.
❌ 안티패턴
CSV for production data: type loss, encoding bugs, no compression — 매 Parquet 의 default.
JSON for analytic at scale: 매 row-by-row parse 의 slow, 100x larger 의 Parquet vs.
No schema validation at boundary: 매 silent corruption — Pydantic / JSON Schema 의 enforce.
Schema-on-read for hot path: 매 latency penalty — 매 schema-on-write + indices.
Wide tables (1000+ cols): 매 most ops 의 unused — 매 column projection / split table.