# Phase R.1:Token 感知分块优化 日期:2026-04-03 状态:已规划 依赖:R.0(现状与目标) 工作量:3 天 --- ## 1. 本阶段目的 解决跨块边界信息丢失问题,实现精确的 token 计数和重叠分块。 --- ## 2. 核心任务 ### Task R.1.1:集成 tiktoken **目标:** 使用 tiktoken 精确计算 token 数,85% 安全边界 **新增文件:** `backend/app/services/chunker.py` ```python import tiktoken class TokenAwareChunker: """Token 感知分块器,85% 安全边界 + 10% 重叠""" def __init__(self, max_tokens: int = 8000, overlap_ratio: float = 0.1): self.encoding = tiktoken.get_encoding("cl100k_base") self.safe_max = int(max_tokens * 0.85) self.overlap_tokens = int(self.safe_max * overlap_ratio) def count_tokens(self, text: str) -> int: return len(self.encoding.encode(text)) ``` --- ### Task R.1.2:实现智能断句 **目标:** 在断点处(标点/空白)智能断开,避免在词汇中间断开 ```python BREAK_POINTS = ['\n', '。', '!', '?', ',', ';', ':', ' ', '\t'] def find_best_breakpoint(text: str, max_pos: int) -> int: """在 max_pos 附近找到最佳断点(标点/空白处)""" for i in range(max_pos - 1, max(0, max_pos - 200), -1): if text[i] in BREAK_POINTS: return i + 1 return max_pos ``` --- ### Task R.1.3:实现重叠分块 **目标:** 10% token 重叠,保证上下文连续性 ```python def chunk_with_overlap(self, text: str) -> list[dict]: """带重叠的分块器,上一块末尾作为下一块开头""" sentences = self._split_sentences(text) chunks = [] current_chunk = "" current_tokens = 0 for sentence in sentences: sentence_tokens = self.count_tokens(sentence) if sentence_tokens > self.safe_max: # 超长句子强制分割 forced = self._force_split_long_text(sentence) chunks.extend(forced) continue if current_tokens + sentence_tokens > self.safe_max: chunks.append({"content": current_chunk.strip()}) # 创建重叠部分 current_chunk = self._create_overlap(sentences, current_tokens) current_tokens = self.count_tokens(current_chunk) current_chunk += sentence current_tokens += sentence_tokens if current_chunk.strip(): chunks.append({"content": current_chunk.strip()}) return chunks ``` --- ## 3. 修改现有文件 ### `backend/app/services/document_service.py` 集成新的 TokenAwareChunker: ```python from app.services.chunker import TokenAwareChunker class DocumentService: def __init__(self, ...): # ... existing init self.chunker = TokenAwareChunker() def _build_chunks(self, parsed: ParsedDocument) -> list[dict]: # 原有逻辑替换为重叠分块 chunks = self.chunker.chunk_with_overlap(parsed.summary) for node in parsed.nodes: node_chunks = self.chunker.chunk_with_overlap(node.text) for chunk in node_chunks: chunks.append(chunk) return chunks ``` --- ## 4. 新增测试 **新增文件:** `backend/tests/services/test_chunker.py` ```python import pytest from app.services.chunker import TokenAwareChunker, find_best_breakpoint class TestTokenAwareChunker: def test_token_counting(self): chunker = TokenAwareChunker(max_tokens=100) text = "Hello, world!" assert chunker.count_tokens(text) > 0 def test_overlap_ratio(self): chunker = TokenAwareChunker(max_tokens=100, overlap_ratio=0.1) assert chunker.overlap_tokens == 10 def test_safe_max(self): chunker = TokenAwareChunker(max_tokens=100) assert chunker.safe_max == 85 class TestSmartBreakpoint: def test_find_breakpoint_at_punctuation(self): text = "Hello, world! How are you?" pos = find_best_breakpoint(text, 15) assert text[pos-1] in [',', '!', '?', '。', '!', '?'] class TestOverlappingChunker: def test_chunks_have_overlap(self): chunker = TokenAwareChunker(max_tokens=50, overlap_ratio=0.2) long_text = "A" * 200 + "." + "B" * 200 chunks = chunker.chunk_with_overlap(long_text) assert len(chunks) >= 2 ``` --- ## 5. 验收标准 - [ ] tiktoken 正确集成,token 计数误差 < 1% - [ ] 超长句子不在词汇中间断开 - [ ] 重叠分块保证上下文连续性 - [ ] 单元测试覆盖率 > 80% - [ ] 集成测试通过(文档上传→分块→检索) --- ## 6. 变更文件清单 | 文件 | 操作 | 说明 | |------|------|------| | `backend/app/services/chunker.py` | 新增 | Token 感知分块器 | | `backend/app/services/document_service.py` | 修改 | 集成新的分块器 | | `backend/tests/services/test_chunker.py` | 新增 | 分块器单元测试 | --- ## 7. 工作量估算 | 任务 | 估算 | |------|------| | R.1.1 tiktoken 集成 | 0.5 天 | | R.1.2 智能断句 | 0.5 天 | | R.1.3 重叠分块 | 1 天 | | 测试 + 调试 | 1 天 | | **R.1 总计** | **3 天** |