JARVIS/development-doc/plan/memory-update/phase-m-4-auto-extraction.md

# Phase M.4：对话自动学习（Auto Memory Extraction）

日期：2026-04-05
状态：规划中
依赖：M.1 (重要性评分)
工作量：3 天

---

## 1. 本阶段目的

让 Jarvis 在每次对话结束后**自动**从对话内容中提取记忆，而不需要用户手动触发。

当前问题：
- `POST /brain/learn/run` 是手动触发，用户不会每次手动调
- 没有自动学习，M.1 的评分系统、M.2 的遗忘系统都缺少输入
- 记忆库会随时间停滞，而不是随使用不断丰富

---

## 2. 核心架构

```
对话结束
    │
    ▼
ConversationEndHook
    │
    ▼
MemoryExtractor
    ├── extract_facts()       # 事实：你住在北京、你用 Python
    ├── extract_preferences() # 偏好：你喜欢简短的回答
    ├── extract_goals()       # 目标：你想学 Rust
    ├── extract_pain_points() # 痛点：反复问同一类问题
    └── extract_events()      # 事件：今天提到的重要事情
         │
         ▼
    ImportanceScorer (M.1)   # 评分后存入 UserMemory
         │
         ▼
    去重检查                  # 避免重复存储相似记忆
         │
         ▼
    UserMemory / BrainMemory
```

---

## 3. 核心实现

### 3.1 MemoryExtractor

```python
class MemoryExtractor:
    async def extract_from_conversation(
        self,
        user_id: str,
        messages: list[Message],
    ) -> list[ExtractedMemory]:
        """
        从一段对话中提取记忆条目。
        调用 LLM 做结构化抽取，返回待存储的记忆列表。
        """

    async def deduplicate(
        self,
        new_memories: list[ExtractedMemory],
        existing_memories: list[UserMemory],
    ) -> list[ExtractedMemory]:
        """
        与现有记忆做相似度对比，过滤重复项。
        相似度 > 0.85 视为重复，更新而非新增。
        """
```

### 3.2 LLM 提取 Prompt（结构化输出）

```python
EXTRACT_PROMPT = """
从以下对话中提取用户的记忆信息，以 JSON 格式返回：

对话内容：
{conversation_text}

提取以下类型：
- fact: 关于用户的客观事实（职业、地点、技能等）
- preference: 用户的偏好和习惯
- goal: 用户提到的目标或计划
- pain_point: 反复出现或明显困扰用户的问题
- event: 今天发生的重要事件

输出格式：
[
  {"type": "fact", "content": "...", "confidence": 0.9},
  {"type": "goal", "content": "...", "confidence": 0.7}
]

只提取明确的信息，不要猜测。
"""
```

### 3.3 触发时机

```python
# 在 conversation router 的对话结束时异步触发
# routers/conversation.py

@router.post("/api/conversations/{conversation_id}/end")
async def end_conversation(conversation_id: str, ...):
    # 原有逻辑...

    # 异步触发记忆提取，不阻塞响应
    background_tasks.add_task(
        memory_extractor.extract_from_conversation,
        user_id=current_user.id,
        messages=messages,
    )
```

也支持**会话超时自动触发**（超过 30 分钟无新消息视为对话结束）：

```python
# scheduler_service.py
@scheduler.scheduled_task("interval", minutes=30)
async def check_idle_conversations():
    """检查闲置对话，触发记忆提取"""
```

---

## 4. 去重逻辑

```python
# 简单相似度检查（用 Mem0 自带的语义去重，或简单字符串匹配）
async def deduplicate(self, new_memory: ExtractedMemory, user_id: str) -> bool:
    """
    返回 True 表示是新记忆，False 表示已存在（更新原记忆即可）
    """
    existing = await self.memory_service.search(
        query=new_memory.content,
        user_id=user_id,
        top_k=3,
    )
    for mem in existing:
        if similarity(mem.content, new_memory.content) > 0.85:
            # 更新现有记忆的 frequency_count，而非新建
            await self.memory_service.reinforce(mem.id)
            return False
    return True
```

---

## 5. 核心文件

### 5.1 新增文件

| 文件 | 职责 |
|------|------|
| `services/memory/memory_extractor.py` | 对话记忆提取 |
| `tests/services/test_memory_extractor.py` | 提取测试 |

### 5.2 修改文件

| 文件 | 修改内容 |
|------|---------|
| `routers/conversation.py` | 对话结束时触发提取 |
| `services/scheduler_service.py` | 添加闲置对话检查 |

---

## 6. 验收标准

| 标准 | 说明 |
|------|------|
| 自动触发 | 对话结束后 30 秒内完成提取 |
| 提取准确 | fact/goal/pain_point 类型识别准确 |
| 去重有效 | 重复内容不新建，只强化原记忆 |
| 不阻塞对话 | 提取为后台任务，不影响响应速度 |
| 单元测试覆盖率 | > 80% |

---

## 7. 工作量估算

| 任务 | 工作量 |
|------|--------|
| MemoryExtractor 实现 | 1 天 |
| LLM Prompt 调优 | 0.5 天 |
| 去重逻辑 | 0.5 天 |
| 触发集成（对话结束 + 调度） | 0.5 天 |
| 测试 | 0.5 天 |
| **合计** | **3 天** |