Files
JARVIS/development-doc/plan/rag-update/phase-r-1-token-chunking.md

5.1 KiB
Raw Blame History

Phase R.1Token 感知分块优化

日期2026-04-03 状态:已规划 依赖R.0(现状与目标) 工作量3 天


1. 本阶段目的

解决跨块边界信息丢失问题,实现精确的 token 计数和重叠分块。


2. 核心任务

Task R.1.1:集成 tiktoken

目标: 使用 tiktoken 精确计算 token 数85% 安全边界

新增文件: backend/app/services/chunker.py

import tiktoken

class TokenAwareChunker:
    """Token 感知分块器85% 安全边界 + 10% 重叠"""
    
    def __init__(self, max_tokens: int = 8000, overlap_ratio: float = 0.1):
        self.encoding = tiktoken.get_encoding("cl100k_base")
        self.safe_max = int(max_tokens * 0.85)
        self.overlap_tokens = int(self.safe_max * overlap_ratio)
    
    def count_tokens(self, text: str) -> int:
        return len(self.encoding.encode(text))

Task R.1.2:实现智能断句

目标: 在断点处(标点/空白)智能断开,避免在词汇中间断开

BREAK_POINTS = ['\n', '。', '', '', '', '', '', ' ', '\t']

def find_best_breakpoint(text: str, max_pos: int) -> int:
    """在 max_pos 附近找到最佳断点(标点/空白处)"""
    for i in range(max_pos - 1, max(0, max_pos - 200), -1):
        if text[i] in BREAK_POINTS:
            return i + 1
    return max_pos

Task R.1.3:实现重叠分块

目标: 10% token 重叠,保证上下文连续性

def chunk_with_overlap(self, text: str) -> list[dict]:
    """带重叠的分块器,上一块末尾作为下一块开头"""
    sentences = self._split_sentences(text)
    chunks = []
    current_chunk = ""
    current_tokens = 0
    
    for sentence in sentences:
        sentence_tokens = self.count_tokens(sentence)
        
        if sentence_tokens > self.safe_max:
            # 超长句子强制分割
            forced = self._force_split_long_text(sentence)
            chunks.extend(forced)
            continue
        
        if current_tokens + sentence_tokens > self.safe_max:
            chunks.append({"content": current_chunk.strip()})
            # 创建重叠部分
            current_chunk = self._create_overlap(sentences, current_tokens)
            current_tokens = self.count_tokens(current_chunk)
        
        current_chunk += sentence
        current_tokens += sentence_tokens
    
    if current_chunk.strip():
        chunks.append({"content": current_chunk.strip()})
    
    return chunks

3. 修改现有文件

backend/app/services/document_service.py

集成新的 TokenAwareChunker

from app.services.chunker import TokenAwareChunker

class DocumentService:
    def __init__(self, ...):
        # ... existing init
        self.chunker = TokenAwareChunker()
    
    def _build_chunks(self, parsed: ParsedDocument) -> list[dict]:
        # 原有逻辑替换为重叠分块
        chunks = self.chunker.chunk_with_overlap(parsed.summary)
        for node in parsed.nodes:
            node_chunks = self.chunker.chunk_with_overlap(node.text)
            for chunk in node_chunks:
                chunks.append(chunk)
        return chunks

4. 新增测试

新增文件: backend/tests/services/test_chunker.py

import pytest
from app.services.chunker import TokenAwareChunker, find_best_breakpoint

class TestTokenAwareChunker:
    def test_token_counting(self):
        chunker = TokenAwareChunker(max_tokens=100)
        text = "Hello, world!"
        assert chunker.count_tokens(text) > 0
    
    def test_overlap_ratio(self):
        chunker = TokenAwareChunker(max_tokens=100, overlap_ratio=0.1)
        assert chunker.overlap_tokens == 10
    
    def test_safe_max(self):
        chunker = TokenAwareChunker(max_tokens=100)
        assert chunker.safe_max == 85

class TestSmartBreakpoint:
    def test_find_breakpoint_at_punctuation(self):
        text = "Hello, world! How are you?"
        pos = find_best_breakpoint(text, 15)
        assert text[pos-1] in [',', '!', '?', '。', '', '']

class TestOverlappingChunker:
    def test_chunks_have_overlap(self):
        chunker = TokenAwareChunker(max_tokens=50, overlap_ratio=0.2)
        long_text = "A" * 200 + "." + "B" * 200
        chunks = chunker.chunk_with_overlap(long_text)
        assert len(chunks) >= 2

5. 验收标准

  • tiktoken 正确集成token 计数误差 < 1%
  • 超长句子不在词汇中间断开
  • 重叠分块保证上下文连续性
  • 单元测试覆盖率 > 80%
  • 集成测试通过(文档上传→分块→检索)

6. 变更文件清单

文件 操作 说明
backend/app/services/chunker.py 新增 Token 感知分块器
backend/app/services/document_service.py 修改 集成新的分块器
backend/tests/services/test_chunker.py 新增 分块器单元测试

7. 工作量估算

任务 估算
R.1.1 tiktoken 集成 0.5 天
R.1.2 智能断句 0.5 天
R.1.3 重叠分块 1 天
测试 + 调试 1 天
R.1 总计 3 天