5.1 KiB
5.1 KiB
Phase R.1:Token 感知分块优化
日期:2026-04-03 状态:已规划 依赖:R.0(现状与目标) 工作量:3 天
1. 本阶段目的
解决跨块边界信息丢失问题,实现精确的 token 计数和重叠分块。
2. 核心任务
Task R.1.1:集成 tiktoken
目标: 使用 tiktoken 精确计算 token 数,85% 安全边界
新增文件: backend/app/services/chunker.py
import tiktoken
class TokenAwareChunker:
"""Token 感知分块器,85% 安全边界 + 10% 重叠"""
def __init__(self, max_tokens: int = 8000, overlap_ratio: float = 0.1):
self.encoding = tiktoken.get_encoding("cl100k_base")
self.safe_max = int(max_tokens * 0.85)
self.overlap_tokens = int(self.safe_max * overlap_ratio)
def count_tokens(self, text: str) -> int:
return len(self.encoding.encode(text))
Task R.1.2:实现智能断句
目标: 在断点处(标点/空白)智能断开,避免在词汇中间断开
BREAK_POINTS = ['\n', '。', '!', '?', ',', ';', ':', ' ', '\t']
def find_best_breakpoint(text: str, max_pos: int) -> int:
"""在 max_pos 附近找到最佳断点(标点/空白处)"""
for i in range(max_pos - 1, max(0, max_pos - 200), -1):
if text[i] in BREAK_POINTS:
return i + 1
return max_pos
Task R.1.3:实现重叠分块
目标: 10% token 重叠,保证上下文连续性
def chunk_with_overlap(self, text: str) -> list[dict]:
"""带重叠的分块器,上一块末尾作为下一块开头"""
sentences = self._split_sentences(text)
chunks = []
current_chunk = ""
current_tokens = 0
for sentence in sentences:
sentence_tokens = self.count_tokens(sentence)
if sentence_tokens > self.safe_max:
# 超长句子强制分割
forced = self._force_split_long_text(sentence)
chunks.extend(forced)
continue
if current_tokens + sentence_tokens > self.safe_max:
chunks.append({"content": current_chunk.strip()})
# 创建重叠部分
current_chunk = self._create_overlap(sentences, current_tokens)
current_tokens = self.count_tokens(current_chunk)
current_chunk += sentence
current_tokens += sentence_tokens
if current_chunk.strip():
chunks.append({"content": current_chunk.strip()})
return chunks
3. 修改现有文件
backend/app/services/document_service.py
集成新的 TokenAwareChunker:
from app.services.chunker import TokenAwareChunker
class DocumentService:
def __init__(self, ...):
# ... existing init
self.chunker = TokenAwareChunker()
def _build_chunks(self, parsed: ParsedDocument) -> list[dict]:
# 原有逻辑替换为重叠分块
chunks = self.chunker.chunk_with_overlap(parsed.summary)
for node in parsed.nodes:
node_chunks = self.chunker.chunk_with_overlap(node.text)
for chunk in node_chunks:
chunks.append(chunk)
return chunks
4. 新增测试
新增文件: backend/tests/services/test_chunker.py
import pytest
from app.services.chunker import TokenAwareChunker, find_best_breakpoint
class TestTokenAwareChunker:
def test_token_counting(self):
chunker = TokenAwareChunker(max_tokens=100)
text = "Hello, world!"
assert chunker.count_tokens(text) > 0
def test_overlap_ratio(self):
chunker = TokenAwareChunker(max_tokens=100, overlap_ratio=0.1)
assert chunker.overlap_tokens == 10
def test_safe_max(self):
chunker = TokenAwareChunker(max_tokens=100)
assert chunker.safe_max == 85
class TestSmartBreakpoint:
def test_find_breakpoint_at_punctuation(self):
text = "Hello, world! How are you?"
pos = find_best_breakpoint(text, 15)
assert text[pos-1] in [',', '!', '?', '。', '!', '?']
class TestOverlappingChunker:
def test_chunks_have_overlap(self):
chunker = TokenAwareChunker(max_tokens=50, overlap_ratio=0.2)
long_text = "A" * 200 + "." + "B" * 200
chunks = chunker.chunk_with_overlap(long_text)
assert len(chunks) >= 2
5. 验收标准
- tiktoken 正确集成,token 计数误差 < 1%
- 超长句子不在词汇中间断开
- 重叠分块保证上下文连续性
- 单元测试覆盖率 > 80%
- 集成测试通过(文档上传→分块→检索)
6. 变更文件清单
| 文件 | 操作 | 说明 |
|---|---|---|
backend/app/services/chunker.py |
新增 | Token 感知分块器 |
backend/app/services/document_service.py |
修改 | 集成新的分块器 |
backend/tests/services/test_chunker.py |
新增 | 分块器单元测试 |
7. 工作量估算
| 任务 | 估算 |
|---|---|
| R.1.1 tiktoken 集成 | 0.5 天 |
| R.1.2 智能断句 | 0.5 天 |
| R.1.3 重叠分块 | 1 天 |
| 测试 + 调试 | 1 天 |
| R.1 总计 | 3 天 |