feat(scoring): completed roadmap Phase 1 & 2 with edge case stability v2.74.0
This commit is contained in:
@@ -0,0 +1,43 @@
|
||||
# Project Chronicle Guard: Search Engine Roadmap
|
||||
|
||||
## 🎯 Current Status: v2.74.0
|
||||
- [x] **Phase 1: Linguistic Foundation Stabilization** (Completed)
|
||||
- [x] **Phase 2: Conflict Scoring Refinement** (Completed)
|
||||
- [ ] **Phase 3: Performance Scaling & Caching** (In Progress)
|
||||
- [ ] **Phase 4: Excerpt Precision Tuning** (Planned)
|
||||
- [ ] **Phase 5: Downstream Integration API** (Planned)
|
||||
|
||||
---
|
||||
|
||||
## 🔬 Phase Details
|
||||
|
||||
### Phase 1: Linguistic Foundation (v2.72.0 - v2.74.0)
|
||||
- **Goal**: Perfect tokenization for mixed KO/EN/Special characters.
|
||||
- **Achievement**:
|
||||
- Bilingual boundary split (e.g., 'Astra의' -> 'Astra', '의').
|
||||
- Hangeul monosyllable preservation (e.g., '한', '글').
|
||||
- Zero-width character cleaning.
|
||||
|
||||
### Phase 2: Conflict Scoring (v2.73.0 - v2.74.0)
|
||||
- **Goal**: Quantitative risk assessment for information conflicts.
|
||||
- **Achievement**:
|
||||
- Tiered severity logic (NONE, LOW, MEDIUM, HIGH).
|
||||
- Substring-based detection to overcome particle interference.
|
||||
- Configurable thresholds via `SCORING_CONFIG`.
|
||||
|
||||
### Phase 3: Performance Scaling (v2.75.0+)
|
||||
- **Goal**: Sub-10ms response for 10k+ documents.
|
||||
- **Action**:
|
||||
- Global module-level caching for IDF and tokens.
|
||||
- Potential worker thread offloading for heavy scoring.
|
||||
|
||||
### Phase 4: Excerpt Precision (Planned)
|
||||
- **Goal**: Maximize context signal-to-noise ratio.
|
||||
- **Action**:
|
||||
- Density-based window starting point restriction.
|
||||
- Multi-stage filtering for optimal text chunking.
|
||||
|
||||
### Phase 5: Integration (Planned)
|
||||
- **Goal**: Seamless RAG pipeline integration.
|
||||
- **Action**:
|
||||
- Strict IO schema definition for downstream AI agents.
|
||||
Reference in New Issue
Block a user