189 lines
5.1 KiB
Markdown
189 lines
5.1 KiB
Markdown
# Phase R.1:Token 感知分块优化
|
||
|
||
日期:2026-04-03
|
||
状态:已规划
|
||
依赖:R.0(现状与目标)
|
||
工作量:3 天
|
||
|
||
---
|
||
|
||
## 1. 本阶段目的
|
||
|
||
解决跨块边界信息丢失问题,实现精确的 token 计数和重叠分块。
|
||
|
||
---
|
||
|
||
## 2. 核心任务
|
||
|
||
### Task R.1.1:集成 tiktoken
|
||
|
||
**目标:** 使用 tiktoken 精确计算 token 数,85% 安全边界
|
||
|
||
**新增文件:** `backend/app/services/chunker.py`
|
||
|
||
```python
|
||
import tiktoken
|
||
|
||
class TokenAwareChunker:
|
||
"""Token 感知分块器,85% 安全边界 + 10% 重叠"""
|
||
|
||
def __init__(self, max_tokens: int = 8000, overlap_ratio: float = 0.1):
|
||
self.encoding = tiktoken.get_encoding("cl100k_base")
|
||
self.safe_max = int(max_tokens * 0.85)
|
||
self.overlap_tokens = int(self.safe_max * overlap_ratio)
|
||
|
||
def count_tokens(self, text: str) -> int:
|
||
return len(self.encoding.encode(text))
|
||
```
|
||
|
||
---
|
||
|
||
### Task R.1.2:实现智能断句
|
||
|
||
**目标:** 在断点处(标点/空白)智能断开,避免在词汇中间断开
|
||
|
||
```python
|
||
BREAK_POINTS = ['\n', '。', '!', '?', ',', ';', ':', ' ', '\t']
|
||
|
||
def find_best_breakpoint(text: str, max_pos: int) -> int:
|
||
"""在 max_pos 附近找到最佳断点(标点/空白处)"""
|
||
for i in range(max_pos - 1, max(0, max_pos - 200), -1):
|
||
if text[i] in BREAK_POINTS:
|
||
return i + 1
|
||
return max_pos
|
||
```
|
||
|
||
---
|
||
|
||
### Task R.1.3:实现重叠分块
|
||
|
||
**目标:** 10% token 重叠,保证上下文连续性
|
||
|
||
```python
|
||
def chunk_with_overlap(self, text: str) -> list[dict]:
|
||
"""带重叠的分块器,上一块末尾作为下一块开头"""
|
||
sentences = self._split_sentences(text)
|
||
chunks = []
|
||
current_chunk = ""
|
||
current_tokens = 0
|
||
|
||
for sentence in sentences:
|
||
sentence_tokens = self.count_tokens(sentence)
|
||
|
||
if sentence_tokens > self.safe_max:
|
||
# 超长句子强制分割
|
||
forced = self._force_split_long_text(sentence)
|
||
chunks.extend(forced)
|
||
continue
|
||
|
||
if current_tokens + sentence_tokens > self.safe_max:
|
||
chunks.append({"content": current_chunk.strip()})
|
||
# 创建重叠部分
|
||
current_chunk = self._create_overlap(sentences, current_tokens)
|
||
current_tokens = self.count_tokens(current_chunk)
|
||
|
||
current_chunk += sentence
|
||
current_tokens += sentence_tokens
|
||
|
||
if current_chunk.strip():
|
||
chunks.append({"content": current_chunk.strip()})
|
||
|
||
return chunks
|
||
```
|
||
|
||
---
|
||
|
||
## 3. 修改现有文件
|
||
|
||
### `backend/app/services/document_service.py`
|
||
|
||
集成新的 TokenAwareChunker:
|
||
|
||
```python
|
||
from app.services.chunker import TokenAwareChunker
|
||
|
||
class DocumentService:
|
||
def __init__(self, ...):
|
||
# ... existing init
|
||
self.chunker = TokenAwareChunker()
|
||
|
||
def _build_chunks(self, parsed: ParsedDocument) -> list[dict]:
|
||
# 原有逻辑替换为重叠分块
|
||
chunks = self.chunker.chunk_with_overlap(parsed.summary)
|
||
for node in parsed.nodes:
|
||
node_chunks = self.chunker.chunk_with_overlap(node.text)
|
||
for chunk in node_chunks:
|
||
chunks.append(chunk)
|
||
return chunks
|
||
```
|
||
|
||
---
|
||
|
||
## 4. 新增测试
|
||
|
||
**新增文件:** `backend/tests/services/test_chunker.py`
|
||
|
||
```python
|
||
import pytest
|
||
from app.services.chunker import TokenAwareChunker, find_best_breakpoint
|
||
|
||
class TestTokenAwareChunker:
|
||
def test_token_counting(self):
|
||
chunker = TokenAwareChunker(max_tokens=100)
|
||
text = "Hello, world!"
|
||
assert chunker.count_tokens(text) > 0
|
||
|
||
def test_overlap_ratio(self):
|
||
chunker = TokenAwareChunker(max_tokens=100, overlap_ratio=0.1)
|
||
assert chunker.overlap_tokens == 10
|
||
|
||
def test_safe_max(self):
|
||
chunker = TokenAwareChunker(max_tokens=100)
|
||
assert chunker.safe_max == 85
|
||
|
||
class TestSmartBreakpoint:
|
||
def test_find_breakpoint_at_punctuation(self):
|
||
text = "Hello, world! How are you?"
|
||
pos = find_best_breakpoint(text, 15)
|
||
assert text[pos-1] in [',', '!', '?', '。', '!', '?']
|
||
|
||
class TestOverlappingChunker:
|
||
def test_chunks_have_overlap(self):
|
||
chunker = TokenAwareChunker(max_tokens=50, overlap_ratio=0.2)
|
||
long_text = "A" * 200 + "." + "B" * 200
|
||
chunks = chunker.chunk_with_overlap(long_text)
|
||
assert len(chunks) >= 2
|
||
```
|
||
|
||
---
|
||
|
||
## 5. 验收标准
|
||
|
||
- [ ] tiktoken 正确集成,token 计数误差 < 1%
|
||
- [ ] 超长句子不在词汇中间断开
|
||
- [ ] 重叠分块保证上下文连续性
|
||
- [ ] 单元测试覆盖率 > 80%
|
||
- [ ] 集成测试通过(文档上传→分块→检索)
|
||
|
||
---
|
||
|
||
## 6. 变更文件清单
|
||
|
||
| 文件 | 操作 | 说明 |
|
||
|------|------|------|
|
||
| `backend/app/services/chunker.py` | 新增 | Token 感知分块器 |
|
||
| `backend/app/services/document_service.py` | 修改 | 集成新的分块器 |
|
||
| `backend/tests/services/test_chunker.py` | 新增 | 分块器单元测试 |
|
||
|
||
---
|
||
|
||
## 7. 工作量估算
|
||
|
||
| 任务 | 估算 |
|
||
|------|------|
|
||
| R.1.1 tiktoken 集成 | 0.5 天 |
|
||
| R.1.2 智能断句 | 0.5 天 |
|
||
| R.1.3 重叠分块 | 1 天 |
|
||
| 测试 + 调试 | 1 天 |
|
||
| **R.1 总计** | **3 天** |
|