feat(agents): Phase 8.4-10.5 built-in plugins, bundled skills, coordinator
This commit is contained in:
255
development-doc/plan/rag-update/phase-r-4-advanced.md
Normal file
255
development-doc/plan/rag-update/phase-r-4-advanced.md
Normal file
@@ -0,0 +1,255 @@
|
||||
# Phase R.4:高级特性(可选)
|
||||
|
||||
日期:2026-04-03
|
||||
状态:已规划(可选)
|
||||
工作量:4.5 天
|
||||
|
||||
---
|
||||
|
||||
## 1. 本阶段目的
|
||||
|
||||
探索更高级的 RAG 增强技术。
|
||||
|
||||
> **注意:** 本阶段为可选特性,不影响核心功能。根据实际需求决定是否实施。
|
||||
|
||||
---
|
||||
|
||||
## 2. 核心任务
|
||||
|
||||
### Task R.4.1:语义去重
|
||||
|
||||
**目标:** 消除冗余检索结果
|
||||
|
||||
**新增文件:** `backend/app/services/deduplicator.py`
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
class SemanticDeduplicator:
|
||||
"""语义去重,消除冗余检索结果"""
|
||||
|
||||
DEDUP_THRESHOLD = 0.88 # 余弦相似度阈值
|
||||
|
||||
def deduplicate(
|
||||
self,
|
||||
results: list[SearchResult],
|
||||
embeddings: list[np.ndarray]
|
||||
) -> list[SearchResult]:
|
||||
if len(results) <= 1:
|
||||
return results
|
||||
|
||||
# 计算余弦相似度矩阵
|
||||
n = len(results)
|
||||
similarity_matrix = np.zeros((n, n))
|
||||
|
||||
for i in range(n):
|
||||
for j in range(i + 1, n):
|
||||
sim = self._cosine_similarity(embeddings[i], embeddings[j])
|
||||
similarity_matrix[i][j] = sim
|
||||
similarity_matrix[j][i] = sim
|
||||
|
||||
# 贪心去重
|
||||
keep = [True] * n
|
||||
for i in range(n):
|
||||
if not keep[i]:
|
||||
continue
|
||||
for j in range(i + 1, n):
|
||||
if keep[j] and similarity_matrix[i][j] > self.DEDUP_THRESHOLD:
|
||||
keep[j] = False
|
||||
|
||||
return [r for r, k in zip(results, keep) if k]
|
||||
|
||||
def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
|
||||
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task R.4.2:语义分桶(可选)
|
||||
|
||||
**目标:** 按主题自动组织检索结果
|
||||
|
||||
**新增文件:** `backend/app/services/semantic_bucket.py`
|
||||
|
||||
```python
|
||||
from collections import defaultdict
|
||||
import numpy as np
|
||||
|
||||
class SemanticBucketing:
|
||||
"""语义分桶,按主题自动组织检索结果"""
|
||||
|
||||
async def bucket_by_topic(
|
||||
self,
|
||||
results: list[SearchResult],
|
||||
embeddings: list[np.ndarray]
|
||||
) -> dict[str, list[SearchResult]]:
|
||||
# 使用层次聚类
|
||||
from sklearn.cluster import AgglomerativeClustering
|
||||
|
||||
n_clusters = min(5, len(results))
|
||||
if n_clusters < 2:
|
||||
return {"default": results}
|
||||
|
||||
clusterer = AgglomerativeClustering(n_clusters=n_clusters)
|
||||
labels = clusterer.fit_predict(np.array(embeddings))
|
||||
|
||||
buckets = defaultdict(list)
|
||||
for r, label in zip(results, labels):
|
||||
buckets[f"topic_{label}"].append(r)
|
||||
|
||||
# 按每个桶内最高分排序
|
||||
sorted_buckets = {}
|
||||
for name, items in buckets.items():
|
||||
sorted_items = sorted(items, key=lambda x: x.score, reverse=True)
|
||||
sorted_buckets[name] = sorted_items
|
||||
|
||||
return sorted_buckets
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task R.4.3:EPA 分析设计(探索)
|
||||
|
||||
**目标:** 语义空间投影分析方案设计
|
||||
|
||||
```python
|
||||
class EPAModule:
|
||||
"""
|
||||
EPA: Embedding Projection Analysis
|
||||
|
||||
分析向量在语义空间中的投影,识别:
|
||||
- 逻辑深度 (Logic Depth): 意图聚焦程度
|
||||
- 熵 (Entropy): 信息散乱程度
|
||||
- 共振 (Resonance): 跨域关联程度
|
||||
|
||||
注意:此模块为高级特性,复杂度高,建议后续探索。
|
||||
"""
|
||||
|
||||
def project(self, vector: np.ndarray) -> dict:
|
||||
"""
|
||||
返回语义投影结果:
|
||||
- logic_depth: 0~1, 高=意图聚焦
|
||||
- entropy: 0~1, 高=信息散乱
|
||||
- resonance: 跨域共振程度
|
||||
- dominant_axes: 主要语义轴
|
||||
"""
|
||||
raise NotImplementedError("EPA 模块探索中")
|
||||
|
||||
def detect_cross_domain_resonance(self, vector: np.ndarray) -> dict:
|
||||
"""
|
||||
检测跨域共振:
|
||||
- 当查询同时触及多个正交语义轴时触发
|
||||
- 返回共振强度和涉及的主要领域
|
||||
"""
|
||||
raise NotImplementedError("EPA 模块探索中")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. 新增测试
|
||||
|
||||
```python
|
||||
# backend/tests/services/test_deduplicator.py
|
||||
|
||||
class TestSemanticDeduplicator:
|
||||
def test_deduplicate_similar_results(self):
|
||||
dedup = SemanticDeduplicator()
|
||||
|
||||
results = [
|
||||
SearchResult(chunk_id="1", score=0.9, ...),
|
||||
SearchResult(chunk_id="2", score=0.85, ...),
|
||||
SearchResult(chunk_id="3", score=0.8, ...),
|
||||
]
|
||||
embeddings = [
|
||||
np.array([0.1, 0.2, 0.3]),
|
||||
np.array([0.11, 0.21, 0.31]), # 与第一个高度相似
|
||||
np.array([0.9, 0.8, 0.7]), # 与第一个不相似
|
||||
]
|
||||
|
||||
deduped = dedup.deduplicate(results, embeddings)
|
||||
assert len(deduped) < len(results) # 应该去掉一些重复结果
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. 验收标准
|
||||
|
||||
- [ ] 语义去重测试通过
|
||||
- [ ] 语义分桶原型完成(可选)
|
||||
- [ ] EPA 分析方案设计完成(可选实现)
|
||||
|
||||
---
|
||||
|
||||
## 5. 变更文件清单
|
||||
|
||||
| 文件 | 操作 | 说明 |
|
||||
|------|------|------|
|
||||
| `backend/app/services/deduplicator.py` | 新增 | 语义去重 |
|
||||
| `backend/app/services/semantic_bucket.py` | 新增(可选) | 语义分桶 |
|
||||
| `backend/tests/services/test_deduplicator.py` | 新增 | 去重测试 |
|
||||
|
||||
---
|
||||
|
||||
## 6. 工作量估算
|
||||
|
||||
| 任务 | 估算 | 状态 |
|
||||
|------|------|------|
|
||||
| R.4.1 语义去重 | 1.5 天 | 必须 |
|
||||
| R.4.2 语义分桶 | 2 天 | 可选 |
|
||||
| R.4.3 EPA 设计 | 1 天 | 可选 |
|
||||
| **R.4 总计(必须)** | **1.5 天** | |
|
||||
| **R.4 总计(含可选)** | **4.5 天** | |
|
||||
|
||||
---
|
||||
|
||||
## 7. EPA 分析详细设计(供后续参考)
|
||||
|
||||
### 7.1 核心概念
|
||||
|
||||
EPA (Embedding Projection Analysis) 受 VCPToolBox TagMemo V6 启发,用于分析查询向量在语义空间中的投影特征。
|
||||
|
||||
### 7.2 关键指标
|
||||
|
||||
| 指标 | 定义 | 计算方式 |
|
||||
|------|------|----------|
|
||||
| Logic Depth | 意图聚焦程度 | 通过计算投影熵值判断 |
|
||||
| Entropy | 信息散乱程度 | 向量分布的熵 |
|
||||
| Resonance | 跨域共振 | 查询跨越多个语义轴的程度 |
|
||||
|
||||
### 7.3 动态 Beta 公式
|
||||
|
||||
```
|
||||
β = σ(L · log(1 + R) - S · noise_penalty)
|
||||
```
|
||||
|
||||
- L: Logic Depth
|
||||
- R: Resonance
|
||||
- S: 噪音程度
|
||||
- σ: 归一化函数
|
||||
|
||||
### 7.4 残差金字塔
|
||||
|
||||
对查询向量进行多级剥离:
|
||||
|
||||
1. 首轮匹配 → 获取主要语义
|
||||
2. 计算残差 → 提取被掩盖的微弱信号
|
||||
3. 递归搜索 → 直到 90% 能量被解释
|
||||
|
||||
### 7.5 LIF 脉冲扩散
|
||||
|
||||
模拟神经元的脉冲传导:
|
||||
|
||||
1. 种子节点激活
|
||||
2. 沿共现矩阵向外扩散(2跳限制)
|
||||
3. 阈值过滤噪音
|
||||
4. 涌现拓扑关联
|
||||
|
||||
---
|
||||
|
||||
## 8. 风险与注意事项
|
||||
|
||||
| 风险 | 影响 | 缓解措施 |
|
||||
|------|------|----------|
|
||||
| EPA 实现复杂度高 | 高 | Phase R.4 可选,暂不实现 |
|
||||
| 聚类计算开销 | 中 | 限制聚类数量,使用高效算法 |
|
||||
| 去重阈值调参 | 中 | 提供配置项,允许用户调整 |
|
||||
Reference in New Issue
Block a user