--- id: wiki-2026-0508-optical-character-recognition title: Optical Character Recognition category: 10_Wiki/Topics status: verified canonical_id: self aliases: [OCR, Text Recognition, Document AI] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [ocr, document-ai, tesseract, paddleocr, donut, vision-language] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: { language: python, framework: paddleocr/transformers } --- # Optical Character Recognition ## 한 줄 이미지·스캔·PDF에서 텍스트를 추출하는 기술 — 전통 CRNN 파이프라인부터 modern Donut/TrOCR/Surya 및 VLM 직접 추출까지. ## 핵심 - **Pipeline**: detection (DBNet/CRAFT) → recognition (CRNN/Transformer) → post-process(NMS, layout). - **Engines**: - Tesseract (오래됨, 다국어, 무료, 정밀도 낮음) - **EasyOCR** (PyTorch, 80+ 언어) - **PaddleOCR** (강력, 중국어/한국어 우수) - **Surya** (modern multilingual, layout) - **TrOCR** (HF, 단어/라인 인식) - **Donut** (OCR-free document understanding) - **VLM-OCR**: GPT-5 Vision, Claude Sonnet 4.7, Qwen2.5-VL — 복잡 양식·표·손글씨에서 강함. - **Layout**: LayoutLMv3, DocFormer, Nougat(논문). - 평가: CER(Character Error Rate), WER, F1 on key fields. ## 💻 패턴 ```python # 1. Tesseract — 빠른 baseline # brew install tesseract tesseract-lang import pytesseract from PIL import Image img = Image.open("invoice.png") text = pytesseract.image_to_string(img, lang="eng+kor") data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT) # data 안에 (text, conf, left, top, width, height) 포함 ``` ```python # 2. EasyOCR — Python-only 다국어 # pip install easyocr import easyocr reader = easyocr.Reader(["en", "ko"]) results = reader.readtext("receipt.jpg", detail=1) for box, text, conf in results: print(f"{conf:.2f}: {text}") ``` ```python # 3. PaddleOCR — 강력한 한·중·영 # pip install paddleocr paddlepaddle from paddleocr import PaddleOCR ocr = PaddleOCR(use_angle_cls=True, lang="korean", show_log=False) res = ocr.ocr("doc.png", cls=True) for line in res[0]: box, (txt, conf) = line print(txt, conf) ``` ```python # 4. Surya — modern, layout + reading order # pip install surya-ocr from surya.ocr import run_ocr from surya.model.detection.model import load_model as det_model, load_processor as det_proc from surya.model.recognition.model import load_model as rec_model from surya.model.recognition.processor import load_processor as rec_proc from PIL import Image img = Image.open("page.png") preds = run_ocr([img], [["ko", "en"]], det_model(), det_proc(), rec_model(), rec_proc()) for line in preds[0].text_lines: print(line.text, line.confidence) ``` ```python # 5. TrOCR (HuggingFace) — 라인 단위 from transformers import TrOCRProcessor, VisionEncoderDecoderModel from PIL import Image proc = TrOCRProcessor.from_pretrained("microsoft/trocr-large-printed") model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-printed") img = Image.open("line.png").convert("RGB") pix = proc(img, return_tensors="pt").pixel_values ids = model.generate(pix) print(proc.batch_decode(ids, skip_special_tokens=True)[0]) ``` ```python # 6. Donut — OCR-free document QA / parsing from transformers import DonutProcessor, VisionEncoderDecoderModel import torch proc = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2") model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2") prompt = "" ids = proc.tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids pix = proc(Image.open("receipt.png").convert("RGB"), return_tensors="pt").pixel_values out = model.generate(pix, decoder_input_ids=ids, max_length=512) print(proc.token2json(proc.batch_decode(out)[0])) # {"menu":[...], "total":...} 직접 JSON 출력 ``` ```python # 7. VLM 직접 OCR (Claude Vision) import anthropic, base64 img_b64 = base64.b64encode(open("scan.png", "rb").read()).decode() client = anthropic.Anthropic() resp = client.messages.create( model="claude-sonnet-4-7", max_tokens=2048, messages=[{"role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_b64}}, {"type": "text", "text": "Extract all text. Output JSON with fields: title, date, line_items[{name, qty, price}], total."}, ]}], ) print(resp.content[0].text) ``` ```python # 8. PDF → 텍스트 (pdfplumber + OCR fallback) import pdfplumber, pytesseract from pdf2image import convert_from_path def pdf_to_text(path): out = [] with pdfplumber.open(path) as pdf: for i, page in enumerate(pdf.pages): t = page.extract_text() or "" if len(t.strip()) < 20: # scanned page → OCR img = convert_from_path(path, first_page=i+1, last_page=i+1)[0] t = pytesseract.image_to_string(img) out.append(t) return "\n\n".join(out) ``` ## 결정 기준 | 상황 | 추천 | |---|---| | 빠른 prototype, 인쇄 영문 | Tesseract | | 다국어 일반 | PaddleOCR / EasyOCR | | 한국어 정밀 | PaddleOCR(ko) / Surya | | 현대 다국어 + layout | Surya | | 라인 단위 정밀 | TrOCR | | 영수증·송장·양식 → 구조 JSON | Donut / VLM | | 복잡 표·손글씨 | Claude Sonnet / GPT-5 Vision | | 학술 PDF (수식) | Nougat | | 디지털 PDF | pdfplumber, OCR fallback | ## 🔗 Graph - Related: `[[Document-AI]]`, ``, ``, ``, ``, ``, `` ## 🤖 LLM 활용 - VLM은 OCR + reasoning 동시 수행 (영수증 → 항목별 카테고리 자동 분류). - 전통 OCR + LLM 후처리: 오타·줄바꿈 normalize, 누락 필드 추론. ## ❌ 안티패턴 - 저화질 이미지를 그대로 입력 (전처리 필수: deskew, binarize, dpi≥300). - Tesseract만 믿고 한국어 받침 깨진 채 배포. - VLM에 큰 PDF를 한 번에 — chunk per page. - confidence threshold 무시하고 raw text 사용. ## 🧪 검증 - Ground truth와 CER/WER 측정. - Field-level F1 (date/total/total_tax 등 critical 필드). - 다양한 폰트·해상도·회전 sample suite 유지. ## 🕓 Changelog - 2026-05-08 Phase 1: 초안. - 2026-05-10 Manual cleanup: 8 패턴, Surya/Donut/VLM 추가.