602 lines
22 KiB
Markdown
602 lines
22 KiB
Markdown
|
|
# Phase R:RAG 系统升级专项
|
|||
|
|
|
|||
|
|
日期:2026-04-03
|
|||
|
|
状态:已规划
|
|||
|
|
借鉴来源:VCPToolBox TagMemo V6 架构
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## R.0 当前现状与目标
|
|||
|
|
|
|||
|
|
### R.0.1 当前 Jarvis RAG 架构
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
用户上传文档 → DocumentService (解析/分块) → ChromaDB (向量存储) → KnowledgeService (检索)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**核心文件:**
|
|||
|
|
- `backend/app/services/document_service.py` - 文档上传/解析/分块
|
|||
|
|
- `backend/app/services/knowledge_service.py` - ChromaDB 向量检索/混合检索
|
|||
|
|
- `backend/app/models/document.py` - Document/DocumentChunk 数据模型
|
|||
|
|
|
|||
|
|
### R.0.2 当前能力矩阵
|
|||
|
|
|
|||
|
|
| 能力 | 状态 | 说明 |
|
|||
|
|
|------|------|------|
|
|||
|
|
| 多格式文档解析 | ✅ | PDF/MD/TXT/DOCX/CSV/XLSX |
|
|||
|
|
| 结构化分块 | ✅ | 基于标题层级、表格、段落 |
|
|||
|
|
| 向量检索 | ✅ | ChromaDB 语义相似度 |
|
|||
|
|
| 关键词检索 | ✅ | SQL LIKE |
|
|||
|
|
| 混合检索 | ✅ | 向量 + 关键词加权 |
|
|||
|
|
| Rerank | ✅ | 语义分*0.7 + 关键词*0.2 + 标题*0.1 |
|
|||
|
|
| 上下文丰富 | ✅ | 自动获取前/后 chunk |
|
|||
|
|
|
|||
|
|
### R.0.3 当前短板
|
|||
|
|
|
|||
|
|
| 短板 | 严重程度 | 影响 |
|
|||
|
|
|------|----------|------|
|
|||
|
|
| 无重叠分块 | 🟡 中 | 跨块边界信息丢失 |
|
|||
|
|
| 单索引架构 | 🟡 中 | 无法按知识类型/重要性分层 |
|
|||
|
|
| 无动态权重 | 🟡 中 | 检索策略静态,不适配查询类型 |
|
|||
|
|
| 无 Tag/标签系统 | 🟡 中 | 无法利用语义标签增强检索 |
|
|||
|
|
| 无懒加载机制 | 🟢 低 | 大量文档时内存占用高 |
|
|||
|
|
| 无遗忘机制 | 🟢 低 | 存储无限增长 |
|
|||
|
|
|
|||
|
|
### R.0.4 VCPToolBox TagMemo 核心借鉴
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
日记文件变化 → TextChunker(Token感知分块85%+10%重叠)
|
|||
|
|
→ EmbeddingUtils(并发批量向量化)
|
|||
|
|
→ SQLite(元数据) + VexusIndex(Rust HNSW向量索引)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**TagMemo V6 检索流程:**
|
|||
|
|
```
|
|||
|
|
Query → EPA分析(逻辑深度L/共振R) → 残差金字塔 → TagBoost(β动态权重)
|
|||
|
|
→ LIF脉冲扩散(2跳) → 向量融合 → VexusIndex搜索
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### R.0.5 目标架构
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌─────────────────────────────────────────────────────────────┐
|
|||
|
|
│ User Query │
|
|||
|
|
└─────────────────────────┬───────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
┌───────────┴───────────┐
|
|||
|
|
│ Query Analyzer │ ← R.3 新增
|
|||
|
|
│ (查询特性分析) │
|
|||
|
|
└───────────┬───────────┘
|
|||
|
|
│
|
|||
|
|
┌────────────────┼────────────────┐
|
|||
|
|
▼ ▼ ▼
|
|||
|
|
┌─────────┐ ┌───────────┐ ┌──────────────┐
|
|||
|
|
│ Default │ │ Important │ │ Code/Meeting │
|
|||
|
|
│ Collection│ │ Collection │ │ Collections │
|
|||
|
|
└────┬─────┘ └─────┬─────┘ └──────┬───────┘
|
|||
|
|
│ │ │
|
|||
|
|
└──────────────────┼─────────────────┘
|
|||
|
|
▼
|
|||
|
|
┌───────────────────────────┐
|
|||
|
|
│ Dynamic Reranker │ ← R.3 新增
|
|||
|
|
│ (Core Tag Boost + 动态权重)│
|
|||
|
|
└───────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌───────────────┐
|
|||
|
|
│ Search Result │
|
|||
|
|
└───────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## R.1 Token 感知分块优化
|
|||
|
|
|
|||
|
|
**目标:** 解决跨块边界信息丢失问题,实现精确的 token 计数和重叠分块
|
|||
|
|
|
|||
|
|
### R.1.1 核心任务
|
|||
|
|
|
|||
|
|
#### Task R.1.1.1:集成 tiktoken
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# services/chunker.py (新增)
|
|||
|
|
import tiktoken
|
|||
|
|
|
|||
|
|
class TokenAwareChunker:
|
|||
|
|
"""Token 感知分块器,85% 安全边界 + 10% 重叠"""
|
|||
|
|
|
|||
|
|
def __init__(self, max_tokens: int = 8000, overlap_ratio: float = 0.1):
|
|||
|
|
self.encoding = tiktoken.get_encoding("cl100k_base")
|
|||
|
|
self.safe_max = int(max_tokens * 0.85)
|
|||
|
|
self.overlap_tokens = int(self.safe_max * overlap_ratio)
|
|||
|
|
|
|||
|
|
def count_tokens(self, text: str) -> int:
|
|||
|
|
return len(self.encoding.encode(text))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Task R.1.1.2:实现智能断句
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
BREAK_POINTS = ['\n', '。', '!', '?', ',', ';', ':', ' ', '\t']
|
|||
|
|
|
|||
|
|
def find_best_breakpoint(text: str, max_pos: int) -> int:
|
|||
|
|
"""在 max_pos 附近找到最佳断点(标点/空白处)"""
|
|||
|
|
for i in range(max_pos - 1, max(0, max_pos - 200), -1):
|
|||
|
|
if text[i] in BREAK_POINTS:
|
|||
|
|
return i + 1
|
|||
|
|
return max_pos
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Task R.1.1.3:实现重叠分块
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def chunk_with_overlap(self, text: str) -> list[dict]:
|
|||
|
|
"""带重叠的分块器,上一块末尾作为下一块开头"""
|
|||
|
|
sentences = self._split_sentences(text)
|
|||
|
|
chunks = []
|
|||
|
|
current_chunk = ""
|
|||
|
|
current_tokens = 0
|
|||
|
|
|
|||
|
|
for sentence in sentences:
|
|||
|
|
sentence_tokens = self.count_tokens(sentence)
|
|||
|
|
|
|||
|
|
if sentence_tokens > self.safe_max:
|
|||
|
|
# 超长句子强制分割
|
|||
|
|
forced = self._force_split_long_text(sentence)
|
|||
|
|
chunks.extend(forced)
|
|||
|
|
continue
|
|||
|
|
|
|||
|
|
if current_tokens + sentence_tokens > self.safe_max:
|
|||
|
|
chunks.append({"content": current_chunk.strip()})
|
|||
|
|
# 创建重叠部分
|
|||
|
|
current_chunk = self._create_overlap(sentences, current_tokens)
|
|||
|
|
current_tokens = self.count_tokens(current_chunk)
|
|||
|
|
|
|||
|
|
current_chunk += sentence
|
|||
|
|
current_tokens += sentence_tokens
|
|||
|
|
|
|||
|
|
if current_chunk.strip():
|
|||
|
|
chunks.append({"content": current_chunk.strip()})
|
|||
|
|
|
|||
|
|
return chunks
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### R.1.2 验收标准
|
|||
|
|
|
|||
|
|
- [ ] tiktoken 正确集成,token 计数误差 < 1%
|
|||
|
|
- [ ] 超长句子不在词汇中间断开
|
|||
|
|
- [ ] 重叠分块保证上下文连续性
|
|||
|
|
- [ ] 单元测试覆盖率 > 80%
|
|||
|
|
|
|||
|
|
### R.1.3 变更文件
|
|||
|
|
|
|||
|
|
| 文件 | 操作 | 说明 |
|
|||
|
|
|------|------|------|
|
|||
|
|
| `services/chunker.py` | 新增 | Token 感知分块器 |
|
|||
|
|
| `services/document_service.py` | 修改 | 集成新的分块器 |
|
|||
|
|
| `tests/test_chunker.py` | 新增 | 分块器单元测试 |
|
|||
|
|
|
|||
|
|
### R.1.4 工作量估算
|
|||
|
|
|
|||
|
|
| 任务 | 估算 |
|
|||
|
|
|------|------|
|
|||
|
|
| R.1.1.1 tiktoken 集成 | 0.5 天 |
|
|||
|
|
| R.1.1.2 智能断句 | 0.5 天 |
|
|||
|
|
| R.1.1.3 重叠分块 | 1 天 |
|
|||
|
|
| 测试 + 调试 | 1 天 |
|
|||
|
|
| **R.1 总计** | **3 天** |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## R.2 多索引架构
|
|||
|
|
|
|||
|
|
**目标:** 按知识类型/重要性分层,支持懒加载和 LRU 淘汰
|
|||
|
|
|
|||
|
|
### R.2.1 核心任务
|
|||
|
|
|
|||
|
|
#### Task R.2.1.1:设计 Collection 分离策略
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# services/multi_index.py (新增)
|
|||
|
|
class MultiIndexManager:
|
|||
|
|
"""多索引管理器,按知识类型分离"""
|
|||
|
|
|
|||
|
|
INDEX_STRATEGIES = {
|
|||
|
|
"default": {"name": "user_{user_id}_default", "description": "通用文档"},
|
|||
|
|
"important": {"name": "user_{user_id}_important", "description": "重要文档(1.2x加权)"},
|
|||
|
|
"code": {"name": "user_{user_id}_code", "description": "代码片段"},
|
|||
|
|
"meeting": {"name": "user_{user_id}_meeting", "description": "会议记录"},
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
def get_collection(self, user_id: str, index_type: str = "default"):
|
|||
|
|
name = self.INDEX_STRATEGIES[index_type]["name"].format(user_id=user_id)
|
|||
|
|
return self.chroma_client.get_or_create_collection(name=name)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Task R.2.1.2:实现懒加载 + LRU TTL
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import time
|
|||
|
|
from threading import Lock
|
|||
|
|
|
|||
|
|
class LazyIndexLoader:
|
|||
|
|
"""懒加载索引,支持 TTL 淘汰"""
|
|||
|
|
|
|||
|
|
def __init__(self, ttl_seconds: int = 7200):
|
|||
|
|
self._cache = {}
|
|||
|
|
self._last_used = {}
|
|||
|
|
self._lock = Lock()
|
|||
|
|
self._ttl = ttl_seconds
|
|||
|
|
|
|||
|
|
def get_or_load(self, key: str, loader_fn) -> Any:
|
|||
|
|
with self._lock:
|
|||
|
|
if key in self._cache:
|
|||
|
|
self._last_used[key] = time.time()
|
|||
|
|
return self._cache[key]
|
|||
|
|
|
|||
|
|
value = loader_fn()
|
|||
|
|
self._cache[key] = value
|
|||
|
|
self._last_used[key] = time.time()
|
|||
|
|
return value
|
|||
|
|
|
|||
|
|
def sweep(self):
|
|||
|
|
"""清理过期索引"""
|
|||
|
|
now = time.time()
|
|||
|
|
expired = [k for k, t in self._last_used.items() if now - t > self._ttl]
|
|||
|
|
for k in expired:
|
|||
|
|
del self._cache[k]
|
|||
|
|
del self._last_used[k]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Task R.2.1.3:实现重要性感知检索
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
async def retrieve_with_importance(
|
|||
|
|
self,
|
|||
|
|
query: str,
|
|||
|
|
user_id: str,
|
|||
|
|
importance_threshold: float = 0.0,
|
|||
|
|
top_k: int = 5,
|
|||
|
|
) -> list[SearchResult]:
|
|||
|
|
"""重要性感知检索,优先返回高重要性文档"""
|
|||
|
|
|
|||
|
|
# 1. 从 default 索引检索
|
|||
|
|
default_results = await self.retrieve(query, user_id, top_k=top_k * 2)
|
|||
|
|
|
|||
|
|
# 2. 从 important 索引检索
|
|||
|
|
important_results = await self.retrieve(
|
|||
|
|
query, user_id,
|
|||
|
|
collection_name=f"user_{user_id}_important",
|
|||
|
|
top_k=top_k
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 3. 合并,重要文档加权
|
|||
|
|
scored = []
|
|||
|
|
for r in default_results:
|
|||
|
|
scored.append((r.score * 0.8, r))
|
|||
|
|
for r in important_results:
|
|||
|
|
scored.append((r.score * 1.2, r)) # 重要文档 1.2x
|
|||
|
|
|
|||
|
|
scored.sort(key=lambda x: x[0], reverse=True)
|
|||
|
|
return [r for _, r in scored[:top_k]]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### R.2.2 验收标准
|
|||
|
|
|
|||
|
|
- [ ] 多 Collection 创建成功
|
|||
|
|
- [ ] 懒加载索引生效(访问时加载,不访问不加载)
|
|||
|
|
- [ ] TTL 淘汰机制工作(2小时无访问自动卸载)
|
|||
|
|
- [ ] 重要性感知检索加权生效
|
|||
|
|
|
|||
|
|
### R.2.3 变更文件
|
|||
|
|
|
|||
|
|
| 文件 | 操作 | 说明 |
|
|||
|
|
|------|------|------|
|
|||
|
|
| `services/multi_index.py` | 新增 | 多索引管理器 |
|
|||
|
|
| `services/knowledge_service.py` | 修改 | 集成多索引支持 |
|
|||
|
|
| `models/document.py` | 修改 | 增加 importance 字段 |
|
|||
|
|
| `tests/test_multi_index.py` | 新增 | 多索引单元测试 |
|
|||
|
|
|
|||
|
|
### R.2.4 工作量估算
|
|||
|
|
|
|||
|
|
| 任务 | 估算 |
|
|||
|
|
|------|------|
|
|||
|
|
| R.2.1.1 Collection 分离策略 | 1 天 |
|
|||
|
|
| R.2.1.2 懒加载 + LRU | 1 天 |
|
|||
|
|
| R.2.1.3 重要性感知检索 | 0.5 天 |
|
|||
|
|
| 测试 + 调试 | 1.5 天 |
|
|||
|
|
| **R.2 总计** | **4 天** |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## R.3 动态权重增强
|
|||
|
|
|
|||
|
|
**目标:** 根据查询特性动态调整检索策略,支持核心标签加权
|
|||
|
|
|
|||
|
|
### R.3.1 核心任务
|
|||
|
|
|
|||
|
|
#### Task R.3.1.1:实现查询特性分析
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# services/query_analyzer.py (新增)
|
|||
|
|
import re
|
|||
|
|
from dataclasses import dataclass
|
|||
|
|
|
|||
|
|
@dataclass
|
|||
|
|
class QueryProfile:
|
|||
|
|
logic_depth: float # 逻辑深度 (0-1): 意图明确程度
|
|||
|
|
is_code_related: bool # 是否代码相关
|
|||
|
|
is_table_related: bool # 是否表格相关
|
|||
|
|
keyword_density: float # 关键词密度
|
|||
|
|
is_conversational: bool # 是否对话式查询
|
|||
|
|
|
|||
|
|
class QueryAnalyzer:
|
|||
|
|
CODE_KEYWORDS = {'code', 'function', 'class', 'api', 'python', 'js', 'bug'}
|
|||
|
|
TABLE_KEYWORDS = {'table', 'sheet', 'excel', 'csv', 'column', 'row', '数据', '统计'}
|
|||
|
|
|
|||
|
|
def analyze(self, query: str) -> QueryProfile:
|
|||
|
|
words = set(re.findall(r'\w+', query.lower()))
|
|||
|
|
|
|||
|
|
return QueryProfile(
|
|||
|
|
logic_depth=self._calc_logic_depth(query),
|
|||
|
|
is_code_related=bool(words & self.CODE_KEYWORDS),
|
|||
|
|
is_table_related=bool(words & self.TABLE_KEYWORDS),
|
|||
|
|
keyword_density=len(words) / max(len(query), 1),
|
|||
|
|
is_conversational=self._is_conversational(query),
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Task R.3.1.2:实现动态 Reranker
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# services/dynamic_reranker.py (新增)
|
|||
|
|
class DynamicReranker:
|
|||
|
|
def rerank(self, query: str, results: list[SearchResult]) -> list[SearchResult]:
|
|||
|
|
profile = QueryAnalyzer().analyze(query)
|
|||
|
|
|
|||
|
|
# 根据查询类型调整权重
|
|||
|
|
weights = self._get_weights(profile)
|
|||
|
|
beta = self._calc_beta(profile)
|
|||
|
|
|
|||
|
|
scored = []
|
|||
|
|
for r in results:
|
|||
|
|
score = r.score * weights["semantic"]
|
|||
|
|
score += self._keyword_score(query, r.content) * weights["keyword"]
|
|||
|
|
score += self._title_score(query, r.document_title) * weights["title"]
|
|||
|
|
|
|||
|
|
# 表格内容加分
|
|||
|
|
if profile.is_table_related:
|
|||
|
|
meta = json.loads(r.metadata_ or "{}")
|
|||
|
|
if meta.get("content_type") == "table_schema":
|
|||
|
|
score += 0.25
|
|||
|
|
|
|||
|
|
score *= beta
|
|||
|
|
scored.append((score, r))
|
|||
|
|
|
|||
|
|
scored.sort(key=lambda x: x[0], reverse=True)
|
|||
|
|
return [r for _, r in scored]
|
|||
|
|
|
|||
|
|
def _get_weights(self, profile: QueryProfile) -> dict:
|
|||
|
|
if profile.is_code_related:
|
|||
|
|
return {"semantic": 0.55, "keyword": 0.35, "title": 0.10}
|
|||
|
|
elif profile.is_table_related:
|
|||
|
|
return {"semantic": 0.50, "keyword": 0.30, "title": 0.20}
|
|||
|
|
elif profile.is_conversational:
|
|||
|
|
return {"semantic": 0.85, "keyword": 0.10, "title": 0.05}
|
|||
|
|
else:
|
|||
|
|
return {"semantic": 0.70, "keyword": 0.20, "title": 0.10}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Task R.3.1.3:实现核心标签系统
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 在 models/document.py 中增加 tags 字段
|
|||
|
|
class DocumentChunk(Base):
|
|||
|
|
tags = Column(JSON, default=list) # ["重要", "代码", "架构"]
|
|||
|
|
is_core = Column(Boolean, default=False) # 是否核心切片
|
|||
|
|
|
|||
|
|
# services/core_tag_search.py (新增)
|
|||
|
|
class CoreTagAwareSearch:
|
|||
|
|
CORE_BOOST_FACTOR = 1.33 # 33% 加权
|
|||
|
|
|
|||
|
|
async def search(self, query: str, user_id: str,
|
|||
|
|
core_tags: list[str] = None) -> list[SearchResult]:
|
|||
|
|
results = await self.base_search(query, user_id)
|
|||
|
|
|
|||
|
|
if core_tags:
|
|||
|
|
for r in results:
|
|||
|
|
meta = json.loads(r.metadata_ or "{}")
|
|||
|
|
chunk_tags = meta.get("tags", [])
|
|||
|
|
|
|||
|
|
if any(tag in chunk_tags for tag in core_tags):
|
|||
|
|
r.score *= self.CORE_BOOST_FACTOR
|
|||
|
|
|
|||
|
|
return sorted(results, key=lambda x: x.score, reverse=True)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### R.3.2 验收标准
|
|||
|
|
|
|||
|
|
- [ ] 查询特性分析准确(代码/表格/对话式识别)
|
|||
|
|
- [ ] 动态权重根据查询类型调整
|
|||
|
|
- [ ] 核心标签检索加权 1.33x
|
|||
|
|
- [ ] Rerank 集成测试通过
|
|||
|
|
|
|||
|
|
### R.3.3 变更文件
|
|||
|
|
|
|||
|
|
| 文件 | 操作 | 说明 |
|
|||
|
|
|------|------|------|
|
|||
|
|
| `services/query_analyzer.py` | 新增 | 查询特性分析 |
|
|||
|
|
| `services/dynamic_reranker.py` | 新增 | 动态 Reranker |
|
|||
|
|
| `services/core_tag_search.py` | 新增 | 核心标签检索 |
|
|||
|
|
| `services/knowledge_service.py` | 修改 | 集成动态权重 |
|
|||
|
|
| `models/document.py` | 修改 | 增加 tags/is_core 字段 |
|
|||
|
|
| `tests/test_dynamic_reranker.py` | 新增 | 动态 Reranker 测试 |
|
|||
|
|
|
|||
|
|
### R.3.4 工作量估算
|
|||
|
|
|
|||
|
|
| 任务 | 估算 |
|
|||
|
|
|------|------|
|
|||
|
|
| R.3.1.1 查询特性分析 | 1 天 |
|
|||
|
|
| R.3.1.2 动态 Reranker | 1 天 |
|
|||
|
|
| R.3.1.3 核心标签系统 | 1 天 |
|
|||
|
|
| 测试 + 调试 | 1.5 天 |
|
|||
|
|
| **R.3 总计** | **4.5 天** |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## R.4 高级特性(可选)
|
|||
|
|
|
|||
|
|
**目标:** 探索更高级的 RAG 增强技术
|
|||
|
|
|
|||
|
|
### R.4.1 Task R.4.1.1:语义去重
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class SemanticDeduplicator:
|
|||
|
|
DEDUP_THRESHOLD = 0.88
|
|||
|
|
|
|||
|
|
def deduplicate(self, results, embeddings) -> list:
|
|||
|
|
"""消除冗余检索结果"""
|
|||
|
|
# 计算余弦相似度矩阵
|
|||
|
|
# 贪心去重
|
|||
|
|
...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### R.4.2 Task R.4.2.1:语义分桶
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class SemanticBucketing:
|
|||
|
|
async def bucket_by_topic(self, results, embeddings) -> dict:
|
|||
|
|
"""按主题自动组织检索结果"""
|
|||
|
|
# 使用聚类算法
|
|||
|
|
...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### R.4.3 Task R.4.3.1:EPA 分析(探索)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class EPAModule:
|
|||
|
|
"""
|
|||
|
|
EPA: Embedding Projection Analysis
|
|||
|
|
高复杂度,Phase R.4 探索
|
|||
|
|
"""
|
|||
|
|
pass # 暂不实现
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### R.4.4 验收标准
|
|||
|
|
|
|||
|
|
- [ ] 语义去重测试通过
|
|||
|
|
- [ ] 语义分桶原型完成
|
|||
|
|
- [ ] EPA 分析方案设计完成(可选实现)
|
|||
|
|
|
|||
|
|
### R.4.5 工作量估算
|
|||
|
|
|
|||
|
|
| 任务 | 估算 |
|
|||
|
|
|------|------|
|
|||
|
|
| R.4.1.1 语义去重 | 1.5 天 |
|
|||
|
|
| R.4.2.1 语义分桶 | 2 天 |
|
|||
|
|
| R.4.3.1 EPA 设计 | 1 天 |
|
|||
|
|
| **R.4 总计(可选)** | **4.5 天** |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## R.5 阶段总结与产出
|
|||
|
|
|
|||
|
|
### R.5.1 完整实施路径
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
R.0 ──────────────────────────────────────────────────────────────┐
|
|||
|
|
│ 现状与目标 │
|
|||
|
|
│ - 当前架构分析 │
|
|||
|
|
│ - 短板识别 │
|
|||
|
|
│ - VCPToolBox 借鉴点 │
|
|||
|
|
└────────────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
R.1 ──────────────────────────────────────────────────────────────┐
|
|||
|
|
│ Token 感知分块优化 │
|
|||
|
|
│ - tiktoken 集成 │
|
|||
|
|
│ - 智能断句 │
|
|||
|
|
│ - 重叠分块 │
|
|||
|
|
│ │
|
|||
|
|
│ 工作量: 3 天 │
|
|||
|
|
└────────────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
R.2 ──────────────────────────────────────────────────────────────┐
|
|||
|
|
│ 多索引架构 │
|
|||
|
|
│ - Collection 分离策略 │
|
|||
|
|
│ - 懒加载 + LRU TTL │
|
|||
|
|
│ - 重要性感知检索 │
|
|||
|
|
│ │
|
|||
|
|
│ 依赖: R.1 │
|
|||
|
|
│ 工作量: 4 天 │
|
|||
|
|
└────────────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
R.3 ──────────────────────────────────────────────────────────────┐
|
|||
|
|
│ 动态权重增强 │
|
|||
|
|
│ - QueryAnalyzer │
|
|||
|
|
│ - DynamicReranker │
|
|||
|
|
│ - CoreTagAwareSearch │
|
|||
|
|
│ │
|
|||
|
|
│ 依赖: R.1 │
|
|||
|
|
│ 工作量: 4.5 天 │
|
|||
|
|
└────────────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
R.4 ──────────────────────────────────────────────────────────────┐
|
|||
|
|
│ 高级特性 (可选) │
|
|||
|
|
│ - 语义去重 │
|
|||
|
|
│ - 语义分桶 │
|
|||
|
|
│ - EPA 分析设计 │
|
|||
|
|
│ │
|
|||
|
|
│ 工作量: 4.5 天(可选) │
|
|||
|
|
└────────────────────────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### R.5.2 总工作量估算
|
|||
|
|
|
|||
|
|
| Phase | 工作量 |
|
|||
|
|
|-------|--------|
|
|||
|
|
| R.1 | 3 天 |
|
|||
|
|
| R.2 | 4 天 |
|
|||
|
|
| R.3 | 4.5 天 |
|
|||
|
|
| R.4(可选) | 4.5 天 |
|
|||
|
|
| **R.1-R.3 必须** | **11.5 天** |
|
|||
|
|
| **R.1-R.4 含可选** | **16 天** |
|
|||
|
|
|
|||
|
|
### R.5.3 产出清单
|
|||
|
|
|
|||
|
|
| 产出 | 对应 Phase |
|
|||
|
|
|------|-----------|
|
|||
|
|
| `services/chunker.py` | R.1 |
|
|||
|
|
| `services/multi_index.py` | R.2 |
|
|||
|
|
| `services/query_analyzer.py` | R.3 |
|
|||
|
|
| `services/dynamic_reranker.py` | R.3 |
|
|||
|
|
| `services/core_tag_search.py` | R.3 |
|
|||
|
|
| `models/document.py` 更新 | R.2, R.3 |
|
|||
|
|
| 单元测试 > 80% | R.1, R.2, R.3 |
|
|||
|
|
| 集成测试通过 | R.1, R.2, R.3 |
|
|||
|
|
|
|||
|
|
### R.5.4 与 Phase 1-5 的关系
|
|||
|
|
|
|||
|
|
| Phase | RAG 协作内容 |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| Phase 1 | 基础加固:Task Schema 追踪 RAG 任务 |
|
|||
|
|
| Phase 2 | 协作:RAG 任务可分解给 Librarian Agent |
|
|||
|
|
| Phase 3 | 动态:支持多索引动态选择 |
|
|||
|
|
| Phase 4 | 可视化:RAG 检索过程可视化 |
|
|||
|
|
| Phase 5 | 高级:EPA 分析、语义分桶 |
|
|||
|
|
| **Phase R** | **独立 RAG 升级路径,可与 Phase 1-5 并行推进** |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## R.6 风险与注意事项
|
|||
|
|
|
|||
|
|
| 风险 | 影响 | 缓解措施 |
|
|||
|
|
|------|------|----------|
|
|||
|
|
| Token 计数不准确 | 低 | 使用 tiktoken 精确计数,多次验证 |
|
|||
|
|
| 索引分离后检索复杂 | 中 | 提供统一检索接口,隐藏内部逻辑 |
|
|||
|
|
| 动态权重调参困难 | 中 | 提供配置项,允许用户调整 |
|
|||
|
|
| EPA 实现复杂度高 | 高 | Phase R.4 可选,暂不实现 |
|