Refine knowledge brain workflow

Align the brain prompts, graph view, and startup defaults with the latest phase 1 flow so local runs and navigation stay consistent.
2026-03-22 22:42:47 +08:00
parent 67ea3d2682
commit 6f594631e9
23 changed files with 1508 additions and 526 deletions
--- a/docs/superpowers/plans/2026-03-20-knowledge-brain-phase-1-task-breakdown.md
+++ b/docs/superpowers/plans/2026-03-20-knowledge-brain-phase-1-task-breakdown.md
@@ -0,0 +1,555 @@
+# Jarvis Knowledge Brain Phase 1 Task Breakdown
+
+## Goal
+Turn the phase-1 knowledge brain blueprint into an execution-ready development task list tied to the current codebase.
+
+---
+
+## A. Backend Persistence Tasks
+
+### A1. Add new brain models
+Create new SQLAlchemy models under `backend/app/models/`:
+- `brain_event.py`
+- `brain_candidate.py`
+- `brain_memory.py`
+- `brain_tag.py`
+- optional link-table definitions in `brain_relations.py` or colocated within the above files
+
+Core entities to add:
+- `BrainEvent`
+- `BrainCandidate`
+- `BrainMemory`
+- `BrainTag`
+- `BrainEventTag`
+- `BrainCandidateTag`
+- `BrainMemoryTag`
+- optional `BrainMemoryEvent`
+
+Acceptance criteria:
+- All models inherit from the project base model pattern.
+- All required enums/status fields are defined.
+- User ownership and timeline fields exist.
+- Link tables support tag filtering and source traceability.
+
+### A2. Register models in model exports
+Update:
+- `backend/app/models/__init__.py`
+
+Acceptance criteria:
+- New brain models are imported and available during metadata initialization.
+
+### A3. Add migration / schema evolution support
+Depending on current project migration approach, add the required DB migration path for the new tables.
+
+Acceptance criteria:
+- New tables can be created in local/dev environments without breaking existing tables.
+- Indexes for `user_id`, status, and date-based access patterns are included.
+
+### A4. Add Pydantic schemas
+Create new schema files under `backend/app/schemas/`:
+- `brain.py`
+
+Schema groups to add:
+- overview response
+- memory list/detail response
+- candidate list response
+- tag response
+- timeline response
+- manual learning trigger response
+- memory/tag management payloads
+
+Acceptance criteria:
+- Schemas match the intended `/api/brain` response shapes.
+- Timeline and traceability structures are explicit, not loosely typed blobs.
+
+---
+
+## B. Backend Service Tasks
+
+### B1. Create brain event ingestion service
+Add:
+- `backend/app/services/brain_event_service.py`
+
+Responsibilities:
+- normalize source records into `BrainEvent`
+- expose helpers such as:
+  - `record_conversation_event(...)`
+  - `record_document_event(...)`
+  - `record_todo_event(...)`
+  - `record_task_event(...)`
+  - `record_forum_event(...)`
+
+Acceptance criteria:
+- Each helper accepts current source-domain inputs without forcing those modules to understand brain internals.
+- Event creation is idempotent enough to avoid obvious duplicate rows for the same source update.
+
+### B2. Create brain learning service
+Add:
+- `backend/app/services/brain_learning_service.py`
+
+Responsibilities:
+- load pending `BrainEvent`s for a given date/user scope
+- cluster related events
+- call the LLM to create candidate knowledge
+- score and dedupe candidates
+- promote high-confidence candidates into `BrainMemory`
+- mark processed events and candidate statuses
+
+Acceptance criteria:
+- Service supports both manual run and scheduler run.
+- Promotion/rejection decisions are explicit and testable.
+- Source event traceability is preserved.
+
+### B3. Create brain tag service
+Add:
+- `backend/app/services/brain_tag_service.py`
+
+Responsibilities:
+- attach and score tags
+- split tags into important vs secondary
+- update tag scores after learning runs
+- support cleanup recommendations
+
+Acceptance criteria:
+- Important/secondary classification is persisted, not only computed in the UI.
+- Tag lookups support filtering memories and timeline entries.
+
+### B4. Create brain retrieval service
+Add:
+- `backend/app/services/brain_retrieval_service.py`
+
+Responsibilities:
+- retrieve relevant `BrainMemory` records by query
+- optionally retrieve recent events for recency-sensitive prompts
+- format results for chat injection and API responses
+
+Acceptance criteria:
+- Retrieval has strict limits to prevent prompt bloat.
+- Results support filtering by tags, source type, and time range.
+
+### B5. Refactor or extend memory service
+Update:
+- `backend/app/services/memory_service.py`
+
+Tasks:
+- keep existing summary and `UserMemory` behavior intact
+- extend `build_memory_context()` to append a `【知识大脑】` block from `BrainRetrievalService`
+- keep memory context size bounded
+
+Acceptance criteria:
+- Existing conversation summary behavior continues to work.
+- Chat can consume `BrainMemory` without requiring a full prompt architecture rewrite.
+
+---
+
+## C. Source Ingestion Integration Tasks
+
+### C1. Conversation → BrainEvent
+Update likely files:
+- `backend/app/services/agent_service.py`
+- possibly `backend/app/services/memory_service.py`
+
+Hook points:
+- after user message persistence
+- after assistant response persistence
+- after summary/memory extraction
+
+Acceptance criteria:
+- Important conversation actions produce normalized `BrainEvent`s.
+- Explicit “remember this” signals are captured as stronger events.
+
+### C2. Document → BrainEvent
+Update likely files:
+- `backend/app/routers/document.py`
+- `backend/app/services/document_service.py`
+- `backend/app/services/knowledge_service.py`
+
+Hook points:
+- upload success
+- indexing completion
+- chunk edit / reindex
+
+Acceptance criteria:
+- Document lifecycle milestones become `BrainEvent`s.
+- Source metadata includes document identity and folder context.
+
+### C3. Todo → BrainEvent
+Update likely files:
+- `backend/app/routers/todo.py`
+- `backend/app/services/todo_service.py`
+
+Hook points:
+- todo creation
+- completion
+- AI-generated todo creation
+
+Acceptance criteria:
+- Todo events reflect both planning and completion signals.
+- AI-generated todos are distinguishable from manual ones.
+
+### C4. Task/Kanban → BrainEvent
+Update likely files:
+- `backend/app/routers/task.py`
+
+Hook points:
+- task creation
+- status change
+- completion
+- priority change
+
+Acceptance criteria:
+- Task state changes create meaningful workstream events.
+- Duplicate writes are avoided on no-op updates.
+
+### C5. Forum → BrainEvent
+Update likely files:
+- `backend/app/routers/forum.py`
+- optionally `backend/app/services/scheduler_service.py`
+
+Hook points:
+- post created
+- reply created
+- forum instruction execution
+
+Acceptance criteria:
+- Forum posts/replies that matter to project state become brain events.
+- Source traceability includes whether the event came from a post, reply, or executed instruction.
+
+---
+
+## D. Scheduler and Daily Learning Tasks
+
+### D1. Add daily brain learning job
+Update:
+- `backend/app/services/scheduler_service.py`
+
+Add:
+- `brain_daily_learning_task()`
+
+Responsibilities:
+- run daily for pending events
+- invoke `BrainLearningService`
+- log promoted/rejected counts
+
+Acceptance criteria:
+- Job is registered in `start_scheduler()`.
+- Job can run safely when there are no pending events.
+
+### D2. Add manual trigger path
+Update or add:
+- `backend/app/routers/scheduler.py` or the new `backend/app/routers/brain.py`
+
+Acceptance criteria:
+- Developers/users can manually run learning for testing.
+- Trigger returns a useful summary, not only a started flag.
+
+### D3. Decide scheduler ownership model for phase 1
+Current scheduler is global. Decide whether phase 1 runs:
+- for all users in one job, or
+- per user loop inside one job
+
+Acceptance criteria:
+- No hard-coded `user_id="default"` behavior remains in new brain learning flow.
+- User iteration strategy is explicit.
+
+---
+
+## E. Backend API Tasks
+
+### E1. Add brain router
+Create:
+- `backend/app/routers/brain.py`
+
+Register in:
+- `backend/app/main.py`
+- `backend/app/routers/__init__.py` if needed
+
+### E2. Implement overview endpoint
+Endpoint:
+- `GET /api/brain/overview`
+
+Should return:
+- active memory count
+- candidate count
+- important tag count
+- recent event count
+- last learning run info
+- today’s promoted/rejected summary
+
+### E3. Implement memory endpoints
+Endpoints:
+- `GET /api/brain/memories`
+- `GET /api/brain/memory/{id}`
+- `POST /api/brain/memory/{id}/archive`
+- `DELETE /api/brain/memory/{id}`
+- optional `POST /api/brain/memory/{id}/promote` if candidate-to-memory management is exposed here
+
+Acceptance criteria:
+- Memory detail shows source traceability and tags.
+- List endpoint supports pagination/filters needed by UI.
+
+### E4. Implement candidate endpoints
+Endpoints:
+- `GET /api/brain/candidates`
+- optional promote/reject endpoints if candidates are user-manageable in phase 1
+
+Acceptance criteria:
+- Candidate status and scoring are inspectable.
+
+### E5. Implement tag endpoints
+Endpoints:
+- `GET /api/brain/tags`
+- `POST /api/brain/tag/{id}/promote`
+- `POST /api/brain/tag/{id}/demote`
+- `DELETE /api/brain/tag/{id}`
+
+Acceptance criteria:
+- API groups tags by important vs secondary.
+- Manual cleanup actions are supported.
+
+### E6. Implement timeline endpoint
+Endpoint:
+- `GET /api/brain/timeline`
+
+Acceptance criteria:
+- Timeline groups records by day or returns a structure easily grouped by day in UI.
+- Includes event entries and memory promotion entries.
+
+### E7. Implement learning trigger endpoint
+Endpoint:
+- `POST /api/brain/learn/run`
+
+Acceptance criteria:
+- Supports manual learning run for current user or all users, depending on phase-1 policy.
+- Returns meaningful run stats.
+
+---
+
+## F. Chat Integration Tasks
+
+### F1. Inject knowledge brain into chat context
+Update:
+- `backend/app/services/agent_service.py`
+- `backend/app/services/memory_service.py`
+
+Acceptance criteria:
+- Relevant `BrainMemory` items appear in prompt context.
+- Context remains concise and bounded.
+- Existing response flow remains stable.
+
+### F2. Add retrieval policy guardrails
+Tasks:
+- define per-query memory limits
+- choose when to include recent events
+- prefer important/high-confidence memories
+
+Acceptance criteria:
+- Brain retrieval does not overwhelm standard conversation context.
+- Time-sensitive answers can still include recent context when needed.
+
+---
+
+## G. Frontend Route and Navigation Tasks
+
+### G1. Introduce a real brain route
+Update likely files:
+- `frontend/src/app/router/routes.ts`
+- `frontend/src/app/navigation/nav.ts`
+
+Tasks:
+- add `/brain`
+- make `知识大脑` point to `/brain`
+- keep `/graph` available as a subview or secondary route
+
+Acceptance criteria:
+- Brain is no longer represented only by the graph page.
+
+### G2. Define frontend brain API client
+Add:
+- `frontend/src/api/brain.ts`
+
+Methods:
+- `getOverview`
+- `getMemories`
+- `getMemoryDetail`
+- `getCandidates`
+- `getTags`
+- `getTimeline`
+- `runLearning`
+- memory/tag management actions
+
+Acceptance criteria:
+- API client matches backend router contract.
+
+---
+
+## H. Frontend Brain Dashboard Tasks
+
+### H1. Create new brain page
+Add:
+- `frontend/src/pages/brain/index.vue`
+
+Core page sections:
+- overview header
+- important tags panel
+- secondary tags panel
+- recent learned knowledge section
+- timeline section
+- graph tab/subview entry
+
+Acceptance criteria:
+- Page is useful even before graph projection is upgraded.
+- Dashboard reflects the brain, not just visualized relationships.
+
+### H2. Add page composable/state logic
+Add:
+- `frontend/src/pages/brain/composables/useBrainView.ts`
+
+Responsibilities:
+- fetch overview/tags/memories/timeline
+- manage filters and selected tags
+- trigger manual learning run
+- manage loading/error states
+
+Acceptance criteria:
+- Page logic stays separate from template complexity.
+
+### H3. Add memory list/detail components
+Suggested additions:
+- `frontend/src/components/brain/BrainMemoryList.vue`
+- `frontend/src/components/brain/BrainMemoryDetail.vue`
+- `frontend/src/components/brain/BrainTagPanel.vue`
+- `frontend/src/components/brain/BrainTimeline.vue`
+
+Acceptance criteria:
+- User can inspect why a memory exists.
+- User can archive/delete memories and promote/demote tags.
+
+### H4. Reposition graph as brain subview
+Possible approaches:
+- keep current `frontend/src/pages/graph/index.vue` but link it from `/brain`
+- or wrap the graph page as one tab inside the brain page
+
+Acceptance criteria:
+- Existing graph functionality remains accessible.
+- Product framing changes from “brain = graph” to “brain includes graph”.
+
+---
+
+## I. Testing Tasks
+
+### I1. Backend model/service tests
+Add tests for:
+- event creation
+- candidate generation status changes
+- promotion into `BrainMemory`
+- tag priority updates
+- timeline aggregation
+
+Suggested locations:
+- `backend/tests/backend/app/services/`
+- `backend/tests/backend/app/routers/`
+
+### I2. Retrieval integration tests
+Add tests for:
+- memory context injection
+- retrieval limits
+- recency-sensitive event inclusion
+
+### I3. API tests
+Add tests for:
+- `/api/brain/overview`
+- `/api/brain/memories`
+- `/api/brain/tags`
+- `/api/brain/timeline`
+- `/api/brain/learn/run`
+
+### I4. Frontend tests
+Add tests for:
+- brain composable fetch flow
+- filter behavior
+- manual learning run UI flow
+- tag grouping and memory rendering
+
+---
+
+## J. Recommended Execution Order
+
+### Wave 1: Foundation
+1. A1-A4 persistence and schemas
+2. B1 brain event service
+3. E1 add router skeleton
+
+### Wave 2: Ingestion
+4. C1-C5 connect all source domains to `BrainEvent`
+
+### Wave 3: Learning
+5. B2 brain learning service
+6. B3 brain tag service
+7. D1-D3 scheduler/manual learning
+
+### Wave 4: Retrieval
+8. B4 brain retrieval service
+9. B5 memory service integration
+10. F1-F2 chat injection and guardrails
+
+### Wave 5: Product surface
+11. E2-E7 complete `/api/brain` endpoints
+12. G1-G2 routing + API client
+13. H1-H4 dashboard and graph repositioning
+
+### Wave 6: Reliability
+14. I1-I4 tests and refinement
+
+---
+
+## K. Files Most Likely to Change in Phase 1
+
+### Backend new files
+- `backend/app/models/brain_event.py`
+- `backend/app/models/brain_candidate.py`
+- `backend/app/models/brain_memory.py`
+- `backend/app/models/brain_tag.py`
+- `backend/app/schemas/brain.py`
+- `backend/app/services/brain_event_service.py`
+- `backend/app/services/brain_learning_service.py`
+- `backend/app/services/brain_tag_service.py`
+- `backend/app/services/brain_retrieval_service.py`
+- `backend/app/routers/brain.py`
+
+### Backend existing files
+- `backend/app/models/__init__.py`
+- `backend/app/main.py`
+- `backend/app/services/memory_service.py`
+- `backend/app/services/agent_service.py`
+- `backend/app/services/scheduler_service.py`
+- `backend/app/routers/document.py`
+- `backend/app/routers/todo.py`
+- `backend/app/routers/task.py`
+- `backend/app/routers/forum.py`
+- possibly `backend/app/services/document_service.py`
+- possibly `backend/app/services/knowledge_service.py`
+
+### Frontend new files
+- `frontend/src/api/brain.ts`
+- `frontend/src/pages/brain/index.vue`
+- `frontend/src/pages/brain/composables/useBrainView.ts`
+- brain-related components under `frontend/src/components/brain/`
+
+### Frontend existing files
+- `frontend/src/app/router/routes.ts`
+- `frontend/src/app/navigation/nav.ts`
+- optionally `frontend/src/pages/graph/index.vue`
+
+---
+
+## L. Phase 1 “Definition of Done” Checklist
+- [ ] Brain persistence models exist and are queryable.
+- [ ] All five core domains emit `BrainEvent`s.
+- [ ] Daily learning creates `BrainCandidate`s and promotes durable `BrainMemory`s.
+- [ ] Tag priority is stored and manageable.
+- [ ] Chat can retrieve relevant brain knowledge.
+- [ ] `/api/brain` endpoints support dashboard and management actions.
+- [ ] `/brain` dashboard exists and is usable without relying on the graph page.
+- [ ] Graph remains available as a secondary/projection view.
+- [ ] Automated tests cover ingestion, promotion, retrieval, and UI basics.
--- a/docs/superpowers/plans/2026-03-20-knowledge-brain-phase-1-task-plan.md
+++ b/docs/superpowers/plans/2026-03-20-knowledge-brain-phase-1-task-plan.md
@@ -0,0 +1,27 @@
+# Task Plan: Jarvis Knowledge Brain Phase 1 Blueprint
+
+## Goal
+Create a practical phase-1 implementation blueprint for the event-driven knowledge brain, covering backend models, services, scheduler jobs, retrieval integration, APIs, and frontend brain module structure.
+
+## Phases
+- [x] Phase 1: Plan and setup
+- [x] Phase 2: Research/gather information
+- [x] Phase 3: Draft blueprint
+- [x] Phase 4: Review and deliver
+
+## Key Questions
+1. Which new persistence models are required for an event-driven knowledge brain?
+2. How should existing conversation, document, todo, task, and forum data flow into the brain?
+3. What should phase 1 include versus defer to later phases?
+4. How should the frontend brain module be structured before full graph intelligence exists?
+
+## Decisions Made
+- Use an event-driven brain architecture instead of extending the current graph-only flow.
+- Keep the current graph as a projection/view layer, not the brain source of truth.
+- Phase 1 should prioritize unified ingestion, candidate generation, long-term memory storage, and retrieval integration.
+
+## Errors Encountered
+- None yet.
+
+## Status
+**Completed** - Separate implementation plan drafted in `knowledge_ingestion_plan.md` and supporting notes updated.
--- a/docs/superpowers/plans/2026-03-20-knowledge-ingestion-normalization-plan.md
+++ b/docs/superpowers/plans/2026-03-20-knowledge-ingestion-normalization-plan.md
@@ -0,0 +1,210 @@
+# Knowledge Ingestion Normalization Plan
+
+## Goal
+Introduce a unified structured-markdown ingestion pipeline for the knowledge center: MinerU for PDF, existing parsers for DOCX/XLSX/CSV/MD/TXT, persisted normalized content, and lightweight hierarchical chunk semantics.
+
+## Scope
+- Backend document parsing and normalization flow
+- Document persistence model updates
+- Incremental retrieval/indexing integration
+- Backfill/reindex strategy for existing documents
+- Test strategy for parser, router, and migration behavior
+
+## Non-Goals
+- Full parent-child chunk graph tables in this phase
+- Rewriting all chunking logic to markdown-first immediately
+- Replacing all non-PDF parsers with a new framework
+- Solving every OCR/image-understanding case in the first pass
+
+## Architecture Decisions
+- **PDF parser:** MinerU
+- **Other parsers:** keep current implementations for DOCX/XLSX/CSV/MD/TXT
+- **Canonical intermediate representation:** `ParsedDocument + structured_markdown`
+- **Canonical persisted content:** add `normalized_content` to `documents`
+- **Hierarchy model:** metadata-based lightweight semantics, not hard foreign-key parent-child chunk tables
+- **Migration strategy:** additive schema change + on-demand rebuild/reindex
+
+## Target Flow
+1. Upload file
+2. Parse by type
+   - PDF -> MinerU -> normalize to ParsedDocument
+   - Other formats -> current parser -> ParsedDocument
+3. Render `ParsedDocument` into `structured_markdown`
+4. Persist document record including `normalized_content`
+5. Build chunks (initially still from nodes, enriched with lightweight hierarchy metadata)
+6. Index into vector store
+7. Serve preview from `normalized_content`
+
+## Data Model Changes
+### documents table
+Add fields:
+- `normalized_content TEXT NULL`
+- `normalized_format VARCHAR(50) NULL` (value like `structured_markdown`)
+- optional later: `normalization_version VARCHAR(50) NULL`
+
+### document_chunks metadata
+Enrich chunk metadata with lightweight hierarchy keys:
+- `chunk_level`
+- `parent_key`
+- `block_key`
+- existing structural metadata remains (`section_path`, `section_title`, `page_number`, `sheet_name`, `row_start`, `row_end`, `content_type`)
+
+Rationale:
+- Supports grouped retrieval and contextual reconstruction
+- Avoids introducing a relational chunk tree prematurely
+
+## Backend Implementation Steps
+### Phase 1: Schema and persistence
+Files:
+- `backend/app/models/document.py`
+- `backend/app/database.py`
+- `backend/app/schemas/document.py`
+- tests under `backend/tests/backend/app`
+
+Changes:
+- Add `normalized_content` and `normalized_format` to `Document`
+- Extend `ensure_document_columns()` to backfill the new columns for existing databases
+- Expose `normalized_content` only where needed for preview/read APIs (avoid broad API expansion if not required yet)
+
+### Phase 2: Introduce structured markdown renderer
+Files:
+- `backend/app/services/document_service.py`
+- possibly a new helper module if the renderer gets too large, but prefer keeping it local initially
+
+Changes:
+- Add `_render_structured_markdown(parsed: ParsedDocument) -> str`
+- Keep current per-format parsing functions
+- After parsing, render once and store into `document.normalized_content`
+- Add `normalized_format='structured_markdown'`
+
+Rendering guidance:
+- headings -> markdown headings
+- paragraphs/text -> plain markdown paragraphs
+- CSV/XLSX tables -> markdown table blocks or fenced structured table blocks when tables are too large/wide
+- PDF page boundaries -> explicit page markers
+- preserve contextual markers in metadata even if markdown cannot express everything perfectly
+
+### Phase 3: MinerU integration for PDF
+Files:
+- `backend/app/services/document_service.py`
+- `backend/pyproject.toml` / lockfile if dependencies are added
+- config if MinerU requires configurable paths/options
+
+Changes:
+- Replace PDF branch with MinerU-backed parsing
+- Map MinerU output into internal `ParsedNode`/`ParsedDocument`
+- Preserve page and block order
+- Represent image blocks as markdown placeholders plus metadata
+
+Image policy:
+- First pass: extract image block references, page number, nearby text, and optional captions
+- Do not perform full image understanding for every image in phase 1
+- Design metadata so high-value image understanding can be added later
+
+### Phase 4: Chunk metadata enrichment
+Files:
+- `backend/app/services/document_service.py`
+- `backend/app/services/knowledge_service.py`
+- tests
+
+Changes:
+- Extend `_build_chunks()` to include lightweight hierarchy metadata:
+  - section headings become natural parent keys
+  - row batches / sheet blocks get stable block keys
+  - PDF page/section blocks preserve ordered grouping
+- Keep current retrieval behavior, but let `_get_related_chunks()` benefit from richer metadata if helpful
+
+### Phase 5: Preview and rebuild behavior
+Files:
+- `backend/app/routers/document.py`
+- `backend/app/services/document_service.py`
+
+Changes:
+- `get_document_content()` should prefer `normalized_content`
+- Fallback to legacy file reading only when normalized content is absent
+- Rebuild/reindex paths should regenerate normalized content before chunk rebuild/indexing
+
+### Phase 6: Backfill strategy
+Approach:
+- Add a rebuild endpoint or reuse existing reindex flow to backfill `normalized_content`
+- Existing documents can be migrated lazily:
+  - when opened
+  - when reindexed
+  - or via an admin/batch rebuild command later
+
+This avoids a risky one-shot migration.
+
+## Error Handling Changes
+Current issue:
+- Upload route can leak parser/dependency problems as generic 500s.
+
+Changes:
+- Convert expected parser/business errors to explicit 4xx responses where appropriate
+- For missing optional parser dependencies, return clear messages such as:
+  - `DOCX parsing dependency missing: python-docx`
+  - `PDF parsing dependency missing/configuration invalid`
+- Keep true unexpected exceptions as 500s
+
+Files:
+- `backend/app/routers/document.py`
+- `backend/app/services/document_service.py`
+
+## Testing Plan
+### Backend unit/integration tests
+1. Schema migration test for new `documents` columns
+2. Renderer tests:
+   - markdown headings preserved
+   - section paths retained in metadata
+   - xlsx/csv table blocks rendered predictably
+   - pdf page markers preserved from MinerU mapping
+3. Upload tests:
+   - successful DOCX/XLSX/CSV/MD/TXT upload stores `normalized_content`
+   - PDF upload stores `normalized_content`
+   - missing dependency returns clear error instead of generic 500 where applicable
+4. Rebuild/reindex tests:
+   - normalized content regenerated
+   - chunks rebuilt with hierarchy metadata
+5. Retrieval tests:
+   - related chunk lookup still works with enriched metadata
+
+### Frontend tests
+Only if the UI surfaces normalized preview directly in this phase:
+- knowledge view preview prefers normalized content from API
+- no regression in upload and refresh persistence behavior
+
+## Suggested Execution Order
+1. Add schema fields + migration guard
+2. Add structured markdown renderer for current parsers
+3. Store normalized content on upload
+4. Update content preview to read normalized content first
+5. Enrich chunk metadata with lightweight hierarchy keys
+6. Integrate MinerU for PDF
+7. Add rebuild/backfill path
+8. Expand tests
+
+## Risks and Mitigations
+### Risk: MinerU integration complexity
+Mitigation:
+- isolate MinerU to PDF branch only
+- keep internal ParsedDocument contract stable
+
+### Risk: markdown rendering loses structure
+Mitigation:
+- preserve critical structure in metadata
+- use explicit block markers for page/sheet/table boundaries
+
+### Risk: broad retrieval regressions
+Mitigation:
+- keep chunking source node-based initially
+- change one layer at a time
+
+### Risk: old documents lack normalized content
+Mitigation:
+- lazy backfill during preview/reindex
+
+## Deliverable Recommendation
+Implement in small PR-sized slices:
+1. schema + normalized renderer + preview fallback
+2. hierarchy metadata enrichment
+3. MinerU PDF integration
+4. rebuild/backfill tooling