5.7 KiB
5.7 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-pdf-format | PDF Format | 10_Wiki/Topics | verified | self |
|
none | A | 0.9 | applied |
|
2026-05-10 | pending |
|
PDF Format
매 한 줄
"매 cross-reference table 의 random-access 의 binary container". 매 Adobe (1993) 의 PostScript-derived 의 ISO 32000 의 standardize 의 page-fixed-layout 의 dominant interchange format. 매 2026 년 의 PDF/A-4 (archival) + PDF 2.0 의 modern variant 의 LLM-extraction 의 challenge 의 source (no semantic structure 의 guarantee).
매 핵심
매 file 구조
- Header —
%PDF-2.0(또는 1.x). - Body — sequence of indirect objects (
N G obj ... endobj). - Cross-reference table (
xref) — byte offset of each object. - Trailer — root + info + size + xref offset.
매 object types
- Boolean, Number, String (literal
()or hex<>), Name (/Name), Array, Dictionary, Stream (filtered binary). - Page tree (Catalog → Pages → Page) + Resources (Font, XObject, etc.).
매 응용
- Text/table extraction (LLM 의 RAG ingest).
- Form fill (AcroForm / XFA).
- Digital signature (PAdES).
- Print fidelity (PDF/X for press).
- Archive (PDF/A — embed fonts, no encryption).
💻 패턴
Text extraction (pypdf, 2026)
from pypdf import PdfReader
reader = PdfReader("doc.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
# pypdf 5.x: layout-mode option for column-aware
text = "\n".join(p.extract_text(extraction_mode="layout") for p in reader.pages)
Better extraction with pdfplumber (preserves layout)
import pdfplumber
with pdfplumber.open("doc.pdf") as pdf:
for page in pdf.pages:
# Tables
for table in page.extract_tables():
print(table)
# Words with bbox
for word in page.extract_words():
print(word['text'], word['x0'], word['top'])
LLM-grade extraction with Unstructured (2026)
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="doc.pdf",
strategy="hi_res", # uses layout model
infer_table_structure=True,
extract_images_in_pdf=True,
)
# Each element: Title, NarrativeText, Table, Image
Generate PDF (reportlab)
from reportlab.lib.pagesizes import A4
from reportlab.pdfgen import canvas
c = canvas.Canvas("out.pdf", pagesize=A4)
c.setFont("Helvetica-Bold", 16)
c.drawString(72, 800, "Invoice #1234")
c.setFont("Helvetica", 10)
for i, line in enumerate(items):
c.drawString(72, 760 - i*14, line)
c.showPage()
c.save()
Modern HTML→PDF (Playwright, replaces wkhtmltopdf)
from playwright.async_api import async_playwright
async def html_to_pdf(html, out):
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.set_content(html)
await page.pdf(path=out, format="A4", print_background=True)
await browser.close()
Sign PDF (PAdES, pyhanko)
from pyhanko.sign import signers, fields
from pyhanko.pdf_utils.incremental_writer import IncrementalPdfFileWriter
with open("input.pdf", "rb") as inf:
w = IncrementalPdfFileWriter(inf)
fields.append_signature_field(w, sig_field_spec=fields.SigFieldSpec("Sig1", box=(50, 50, 200, 100)))
signer = signers.SimpleSigner.load("cert.pem", "key.pem")
with open("signed.pdf", "wb") as out:
signers.sign_pdf(w, signers.PdfSignatureMetadata(field_name="Sig1"), signer=signer, output=out)
Repair / linearize (qpdf CLI)
qpdf --linearize input.pdf output.pdf
qpdf --object-streams=generate --compress-streams=y input.pdf small.pdf
qpdf --check input.pdf # validate xref + structure
Encrypted PDF
from pypdf import PdfWriter
writer = PdfWriter(clone_from="doc.pdf")
writer.encrypt(user_password="user", owner_password="owner", algorithm="AES-256")
with open("encrypted.pdf", "wb") as f:
writer.write(f)
매 결정 기준
| 상황 | Tool |
|---|---|
| Text extraction (simple) | pypdf 5.x |
| Layout / tables | pdfplumber |
| LLM RAG ingest | Unstructured + hi_res / Marker / Docling |
| Generation (reports) | reportlab / WeasyPrint |
| HTML → PDF (modern) | Playwright (Chrome headless) |
| Forms / signing | pyhanko + qpdf |
| Repair / optimize | qpdf, mutool |
기본값: 매 ingest → Unstructured (layout-aware), 매 generate → Playwright (HTML).
🔗 Graph
- 부모: Document Format · ISO Standards
- 변형: PDF/A · PDF/X · XFA
- 응용: RAG Ingestion · Document AI · Digital Signature
- Adjacent: PostScript · OCR · Layout Analysis
🤖 LLM 활용
언제: 매 form-filled PDF 의 question. 매 extraction tool 의 selection. 매 schema mapping. 언제 X: 매 binary blob 의 direct edit 의 LLM 의 X. 매 spec-conformant tool 의 use.
❌ 안티패턴
- Regex-based PDF parsing: 매 binary + xref 의 fragile. 매 lib 의 사용.
- Single extraction strategy: 매 scanned PDF 의 OCR fallback. 매 hi_res strategy.
- No PDF/A for archive: 매 font 의 missing 의 future render fail.
🧪 검증 / 중복
- Verified (ISO 32000-2:2020, pypdf docs, Unstructured docs, qpdf manual).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — PDF structure + 2026 extraction/generation toolchain |