v2.2.256: 코어 채팅 큰 입력 청킹·통합 + 실제 컨텍스트 창 정렬 + 모델 핸들 race 수정

큰 입력 시 "Failed to acquire LM Studio model handle … Operation canceled" 로 턴 전체가 죽던 문제를 3계층으로 해결. 일반 채팅(코어 경로)은 그동안 단일 예산 호출이라 약한 모델·큰 입력에서 무너졌다 — 그 갭을 메움. - 핸들 race 수정: getModelHandle 을 재시도 루프 안으로 이동. 취소/죽은-핸들 류 에러는 SDK 재생성 후 1회 자동 재시도(실제 사용자 취소는 존중). 라이프 사이클의 동시 로드가 abort 되며 SDK 가 coalesce 한 JIT 조회까지 죽던 것. - Phase 1 실제 창 정렬: llm.getContextLength()(캐시)로 실측 창에 예산 클램프. 설정값보다 작은 창으로 로드된 경우 서버 truncation/빈 답변 차단. 배지에 표시. - Phase 2 코어 Map-Reduce: 단일 입력이 (유효 창 × ratio) 초과 시 청크→질의 인지형 추출→통합. 부분/전체 폴백, 무관 시 정직 신호. 동시성 기본 2. - Phase 3 메타 노출: 진행/결과 배지 표시, [조각 k] 출처 옵트인. 신규 설정 5종. /meet·/review 전용 경로는 불변. 테스트 +25건, 전체 684 통과. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 18:05:44 +09:00
parent 6adbc2a6fa
commit 76d5fedfb5
13 changed files with 883 additions and 19 deletions
@@ -2,7 +2,7 @@
  "name": "astra",
  "displayName": "Astra",
  "description": "The personal intelligence layer for Antigravity and VS Code. A private cognitive partner for deep project context, memory, and proactive strategic decision-making.",
-  "version": "2.2.255",
+  "version": "2.2.256",
  "publisher": "g1nation",
  "license": "MIT",
  "icon": "assets/icon.png",
@@ -441,6 +441,37 @@
          "minimum": 0,
          "description": "Optional safety knob, OFF by default (0). Some very small models (≤3B) emit an empty/EOS response when given a prompt near their context window even though it nominally fits. If you observe that with a tiny model, set this to e.g. 8192–16384: for ≤3B models only, Astra then budgets the prompt against this smaller effective window instead of g1nation.contextLength. Never applies to 4B+ models. Leave 0 unless you actually hit the issue — it reduces the output-token budget. Default: 0 (disabled)"
        },
+        "g1nation.largeInputMapReduce": {
+          "type": "boolean",
+          "default": true,
+          "description": "When a single message is too large to fit the model's context window, split it into chunks, extract only the request-relevant facts from each (no hallucination/summary), integrate them, and answer from the condensed result. Turn off to send oversized input in one shot (the server may then truncate it). Default: true"
+        },
+        "g1nation.mapReduceTriggerRatio": {
+          "type": "number",
+          "default": 0.6,
+          "minimum": 0.3,
+          "maximum": 0.95,
+          "description": "Map-reduce kicks in when a single message exceeds (effective context window × this ratio). Lower = engages sooner (safer for big inputs, more LLM calls). Default: 0.6"
+        },
+        "g1nation.mapReduceConcurrency": {
+          "type": "number",
+          "default": 2,
+          "minimum": 1,
+          "maximum": 8,
+          "description": "How many chunk extractions run in parallel. Keep low on a single local GPU (one model serves them sequentially anyway). Default: 2"
+        },
+        "g1nation.mapReduceMaxDepth": {
+          "type": "number",
+          "default": 3,
+          "minimum": 1,
+          "maximum": 6,
+          "description": "Maximum hierarchical-integration depth when the combined extractions still overflow the window. Default: 3"
+        },
+        "g1nation.mapReduceShowProvenance": {
+          "type": "boolean",
+          "default": false,
+          "description": "Tag each extracted block with its source chunk ([조각 k]) so the final answer can be traced back to the part of the input it came from. Default: false"
+        },
        "g1nation.autoContinueOnOutputLimit": {
          "type": "boolean",
          "default": true,