JARVIS/docs/superpowers/specs/2026-03-20-knowledge-brain-phase-1-blueprint.md

# Jarvis Knowledge Brain Phase 1 Blueprint

## 1. Phase 1 Goal
Phase 1 establishes the first production-ready version of Jarvis's event-driven knowledge brain. The objective is not to finish the entire intelligence system, but to create the minimum architecture that lets Jarvis ingest key user actions from across the product, learn from them on a daily schedule, store only high-value knowledge, and retrieve that knowledge during future conversations.

Phase 1 should make the brain real in six ways:
1. unify source events across core modules;
2. create an intermediate candidate-learning layer;
3. promote durable knowledge into long-term brain memory;
4. maintain tags and time-aware traceability;
5. expose APIs for inspection and management;
6. allow the chat system to retrieve brain knowledge during answers.

---

## 2. Scope Boundaries

### In scope
- New persistence models for brain events, candidates, memories, tags, and relationships.
- Ingestion of source signals from conversations, knowledge documents, todos, kanban tasks, and forum posts.
- A daily autonomous learning pipeline that tags, scores, deduplicates, and upgrades knowledge.
- Retrieval integration for future responses.
- Brain dashboard APIs.
- A new frontend brain module structure replacing the current graph-only mental model.

### Out of scope for phase 1
- Full graph-native reasoning engine.
- Fully autonomous suggestion orchestration across all screens.
- Complex reinforcement-learning style adaptation.
- Fine-grained user-tunable learning policy UI.
- Automatic deletion and archival heuristics beyond simple status transitions.

---

## 3. Target Architecture
Phase 1 should introduce a four-layer brain pipeline:

1. **Source Records**
   Existing domain tables remain the source of truth: messages, documents/chunks, todos, tasks, forum posts/replies.

2. **BrainEvent**
   A normalized event layer representing meaningful user/system actions. This is the single intake format for downstream learning.

3. **BrainCandidate**
   AI-generated candidate knowledge distilled from one or more events. Candidates are scored, tagged, typed, and traced back to source events.

4. **BrainMemory**
   Durable long-term memory that Jarvis can retrieve during future interactions. This becomes the brain's core persistence layer.

Graph visualization should be treated as a **projection layer**, not the primary storage model. In later phases, graph nodes and edges can be generated from BrainMemory records and their relationships.

---

## 4. Data Model Additions

### 4.1 BrainEvent
Purpose: normalized raw learning input.

Recommended fields:
- `id`
- `user_id`
- `source_type` (`conversation`, `document`, `todo`, `task`, `forum_post`, `forum_reply`)
- `source_id`
- `event_type` (`created`, `updated`, `completed`, `mentioned`, `uploaded`, `resolved`, `marked_important`, etc.)
- `occurred_at`
- `event_date`
- `title`
- `content_summary`
- `raw_excerpt`
- `metadata_` (JSON; source-specific facts such as conversation_id, task status, folder path)
- `importance_signal` (numeric seed score)
- `is_user_pinned`
- `processed_at`
- `status` (`pending`, `processed`, `ignored`)

Indexes:
- `(user_id, event_date)`
- `(user_id, source_type, source_id)`
- `(user_id, status, occurred_at)`

### 4.2 BrainCandidate
Purpose: intermediate learned knowledge awaiting acceptance into durable memory.

Recommended fields:
- `id`
- `user_id`
- `candidate_type` (`preference`, `habit`, `project_fact`, `decision`, `solution`, `topic`, `goal`, `temporary_focus`)
- `title`
- `summary`
- `importance_score`
- `confidence_score`
- `time_scope` (`short_term`, `phase`, `long_term`)
- `valid_from`
- `valid_to`
- `source_event_ids` (JSON array)
- `reasoning_trace` (short explanation of why the system extracted it)
- `status` (`new`, `promoted`, `rejected`, `merged`)
- `created_at`
- `reviewed_at`

### 4.3 BrainMemory
Purpose: durable brain knowledge used at retrieval time.

Recommended fields:
- `id`
- `user_id`
- `memory_type` (`preference`, `habit`, `goal`, `project_fact`, `decision`, `solution`, `topic_profile`)
- `title`
- `content`
- `importance`
- `confidence`
- `timeline_date`
- `first_learned_at`
- `last_reinforced_at`
- `reinforcement_count`
- `status` (`active`, `archived`, `deleted`)
- `origin_candidate_id`
- `origin_source_types` (JSON array)
- `metadata_` (JSON)

### 4.4 BrainTag
Purpose: independent tagging layer for brain browsing, filtering, and scoring.

Recommended fields:
- `id`
- `user_id`
- `name`
- `category` (`topic`, `value`, `time`, `source`)
- `priority` (`important`, `secondary`)
- `score`
- `last_seen_at`
- `created_at`

### 4.5 Link Tables
Add many-to-many link tables:
- `brain_event_tags`
- `brain_candidate_tags`
- `brain_memory_tags`
- optional `brain_memory_events` for direct memory-to-event traceability beyond JSON arrays

These link tables are critical because phase 1 needs tag filters and timeline tracing before advanced graph projection exists.

---

## 5. Ingestion Strategy
Phase 1 should not rewrite existing modules. Instead, it should add thin ingestion hooks near existing write paths.

### Conversation ingestion
Trigger points:
- after user message creation
- after assistant completion
- after memory extraction / summary creation

Event examples:
- important user instruction
- explicit “remember this” request
- repeated topic cluster
- conversation-derived decision or unresolved goal

### Document ingestion
Trigger points:
- after upload success
- after indexing completes
- after manual chunk edits

Event examples:
- document uploaded
- document indexed
- high-value section discovered
- document summary available

### Todo ingestion
Trigger points:
- todo created
- todo completed
- AI-generated todo created

Event examples:
- planned work item
- recurring operational duty
- completion signal reflecting actual user focus

### Task/Kanban ingestion
Trigger points:
- task created
- task status changed
- task completed
- priority changed

Event examples:
- declared project goal
- active workstream
- resolved milestone

### Forum ingestion
Trigger points:
- post created
- reply created
- forum instruction executed or referenced

Event examples:
- public project decision
- repeated operational issue
- reusable explanation or solution

Implementation note: source ingestion should create BrainEvent rows synchronously or via lightweight background tasks, but should not block the original user flow.

---

## 6. Learning and Promotion Pipeline
Phase 1 should add a new daily scheduler workflow dedicated to the brain.

### New scheduler job: `brain_daily_learning_task`
Suggested run: once daily after the bulk of user activity, for example 01:00 or configurable per user later.

Pipeline steps:
1. collect unprocessed `BrainEvent` rows for the target date;
2. cluster by source, topic, and repeated patterns;
3. ask the LLM to produce candidate knowledge with tags and importance explanations;
4. deduplicate against existing `BrainMemory` by semantic and rule-based matching;
5. promote high-confidence candidates into `BrainMemory`;
6. mark low-value candidates rejected or retained as observation-only;
7. refresh tag scores and priority levels;
8. mark consumed events as processed.

### Promotion rules for phase 1
Promote automatically when any of these are true:
- user explicitly requested the system to remember something;
- the same topic appears across multiple sources;
- a solution/decision was formed and looks reusable;
- a stable preference or habit is seen repeatedly;
- a task/todo/forum thread confirms relevance with user action.

Keep as candidate-only when:
- information is recent but not yet stable;
- importance is uncertain;
- it appears only once without reinforcement.

Reject when:
- content is obviously transient;
- it is too generic to help future answers;
- it duplicates active memory without adding new value.

---

## 7. Retrieval Integration
Phase 1 must let chat use the brain in a controlled way.

### New retrieval service
Add a dedicated `brain_retrieval_service` or extend `memory_service` with brain-aware retrieval APIs.

Responsibilities:
- retrieve top relevant `BrainMemory` rows by query, tags, time context, and importance;
- optionally retrieve recent `BrainEvent` summaries for recency-sensitive answers;
- merge existing `UserMemory` and `MemorySummary` into one retrieval result shape;
- support limits to avoid prompt bloat.

### Retrieval policy
At answer time:
- always consider long-term `BrainMemory`;
- include recent event summaries only when the question appears time-sensitive or project-state-sensitive;
- cap injected brain context to a small curated set.

Recommended first integration path:
- extend `build_memory_context()` to append a new `【知识大脑】` block built from `BrainMemory` retrieval.
- keep existing conversation summary logic intact.

This gives immediate product value without requiring a full prompt orchestration rewrite.

---

## 8. Backend Services to Add or Refactor

### New services
1. `brain_event_service.py`
   - normalize incoming source data into BrainEvent rows
   - provide source-specific helper constructors

2. `brain_learning_service.py`
   - run daily candidate extraction
   - score, dedupe, and promote memories

3. `brain_tag_service.py`
   - manage tags, scoring, priority updates, and cleanup suggestions

4. `brain_retrieval_service.py`
   - retrieve relevant memories and recent events for chat and UI

### Existing services to extend
- `memory_service.py`: integrate BrainMemory retrieval and possibly migrate `UserMemory` into the new model later
- `scheduler_service.py`: register brain daily learning job
- `agent_service.py`: inject retrieved brain context into chat pipeline
- `document_service.py`, `todo_service.py`, task/forum write paths: emit BrainEvent rows

---

## 9. API Plan
Phase 1 should add a dedicated `/api/brain` router.

### Read APIs
- `GET /api/brain/overview`
  - counts: active memories, candidates, important tags, recent events
  - today's learning summary

- `GET /api/brain/memories`
  - filters: tag, type, status, date range, source type

- `GET /api/brain/candidates`
  - filters: status, date, score threshold

- `GET /api/brain/tags`
  - segmented into important and secondary

- `GET /api/brain/timeline`
  - grouped by day/week; includes events, candidate promotions, reinforced memories

- `GET /api/brain/memory/{id}`
  - full traceability including linked events and tags

### Write/management APIs
- `POST /api/brain/memory/{id}/promote`
- `POST /api/brain/memory/{id}/archive`
- `DELETE /api/brain/memory/{id}`
- `POST /api/brain/tag/{id}/promote`
- `POST /api/brain/tag/{id}/demote`
- `DELETE /api/brain/tag/{id}`
- `POST /api/brain/learn/run`
  - manual trigger for daily learning pipeline

### Compatibility note
Do not remove `/api/graph` in phase 1. Keep it as a legacy projection route while the new brain module is introduced.

---

## 10. Frontend Module Structure
The current `知识大脑` nav item should stop meaning “graph only” and become a real brain dashboard.

### Route strategy
Preferred phase 1 structure:
- `/brain` → new knowledge brain dashboard
- `/graph` → graph view tab or subview under the brain module, retained for relation visualization

### Brain dashboard sections
1. **Overview header**
   - total active memories
   - today's learned items
   - important tags count
   - last learning run

2. **Important tags panel**
   - AI-ranked important tags
   - click to filter related memories and timeline entries

3. **Secondary tags panel**
   - lower-priority tags with cleanup actions

4. **Recent learned knowledge**
   - newly promoted memories
   - reasons and source badges

5. **Timeline panel**
   - daily grouped events and promotions
   - support time-based backtracking

6. **Graph subview**
   - optional tab or secondary panel for relation projection

### User actions in phase 1
- delete memory
- archive memory
- promote/demote tag priority
- manually trigger learning run
- inspect why a memory exists

This is enough to make the brain visible and manageable even before advanced graph reasoning exists.

---

## 11. Suggested Delivery Breakdown

### Step 1: Persistence foundation
- add brain models and migrations
- add SQLAlchemy registrations and schemas

### Step 2: Event ingestion
- emit BrainEvent rows from conversation/document/todo/task/forum flows

### Step 3: Learning workflow
- implement daily learning job and manual trigger API

### Step 4: Retrieval integration
- wire BrainMemory into chat context assembly

### Step 5: Brain dashboard backend
- add overview, memories, tags, timeline endpoints

### Step 6: Brain dashboard frontend
- add `/brain` page and move graph into a subview or separate tab

---

## 12. Risks and Guardrails

### Main risks
- over-collection leading to noisy memories;
- prompt bloat from injecting too much brain context;
- duplicate memory creation across repeated daily runs;
- unclear distinction between candidate and durable memory;
- UI becoming graph-centric again instead of brain-centric.

### Guardrails
- enforce candidate layer before promotion;
- cap retrieval size strictly;
- keep source traceability for every promoted memory;
- make tag cleanup explicit in UI;
- treat graph as a projection, not the source of truth.

---

## 13. Phase 1 Success Criteria
Phase 1 is successful when all of the following are true:
- the system creates normalized BrainEvent rows from all five major source domains;
- a scheduled daily learning job produces candidates and promotes high-value memories;
- Jarvis can retrieve durable brain memories during future answers;
- the frontend exposes a real brain dashboard with tags, recent knowledge, and timeline;
- users can inspect and clean what the system learned;
- the old graph page is no longer the only visible representation of the brain.