"매 unstructured text → structured signal". Text mining 매 large text corpora 에서 patterns / entities / relationships / sentiment 의 extract 하는 분야. 매 traditional (TF-IDF, NER models) 에서 매 LLM-based extraction (structured output, function calling) 으로 매 paradigm shift.
Biomedical literature mining (gene/protein/disease NER).
💻 패턴
spaCy NER (traditional)
importspacynlp=spacy.load("en_core_web_trf")doc=nlp("Apple acquired Anthropic for $50B in March 2025.")forentindoc.ents:print(ent.text,ent.label_)# Apple ORG, Anthropic ORG, $50B MONEY, March 2025 DATE
언제: 매 unstructured text corpus 의 query / extract / classify, schema-driven extraction, low-to-medium volume.
언제 X: 매 milli-second latency 의 필요 (real-time chat moderation) — 매 small distilled model.
❌ 안티패턴
Regex-only complex extraction: 매 brittle — 매 LLM hybrid 로 graceful.
No evaluation set: 매 LLM 매 hallucinate — 매 ground-truth eval 의 maintain.
Full-document LLM 의 every query: 매 cache or pre-extract structured DB.
Unicode normalization 의 skip: 매 Korean/CJK text 매 NFC normalize 필수.