Refine knowledge brain workflow
Align the brain prompts, graph view, and startup defaults with the latest phase 1 flow so local runs and navigation stay consistent.
This commit is contained in:
@@ -0,0 +1,555 @@
|
||||
# Jarvis Knowledge Brain Phase 1 Task Breakdown
|
||||
|
||||
## Goal
|
||||
Turn the phase-1 knowledge brain blueprint into an execution-ready development task list tied to the current codebase.
|
||||
|
||||
---
|
||||
|
||||
## A. Backend Persistence Tasks
|
||||
|
||||
### A1. Add new brain models
|
||||
Create new SQLAlchemy models under `backend/app/models/`:
|
||||
- `brain_event.py`
|
||||
- `brain_candidate.py`
|
||||
- `brain_memory.py`
|
||||
- `brain_tag.py`
|
||||
- optional link-table definitions in `brain_relations.py` or colocated within the above files
|
||||
|
||||
Core entities to add:
|
||||
- `BrainEvent`
|
||||
- `BrainCandidate`
|
||||
- `BrainMemory`
|
||||
- `BrainTag`
|
||||
- `BrainEventTag`
|
||||
- `BrainCandidateTag`
|
||||
- `BrainMemoryTag`
|
||||
- optional `BrainMemoryEvent`
|
||||
|
||||
Acceptance criteria:
|
||||
- All models inherit from the project base model pattern.
|
||||
- All required enums/status fields are defined.
|
||||
- User ownership and timeline fields exist.
|
||||
- Link tables support tag filtering and source traceability.
|
||||
|
||||
### A2. Register models in model exports
|
||||
Update:
|
||||
- `backend/app/models/__init__.py`
|
||||
|
||||
Acceptance criteria:
|
||||
- New brain models are imported and available during metadata initialization.
|
||||
|
||||
### A3. Add migration / schema evolution support
|
||||
Depending on current project migration approach, add the required DB migration path for the new tables.
|
||||
|
||||
Acceptance criteria:
|
||||
- New tables can be created in local/dev environments without breaking existing tables.
|
||||
- Indexes for `user_id`, status, and date-based access patterns are included.
|
||||
|
||||
### A4. Add Pydantic schemas
|
||||
Create new schema files under `backend/app/schemas/`:
|
||||
- `brain.py`
|
||||
|
||||
Schema groups to add:
|
||||
- overview response
|
||||
- memory list/detail response
|
||||
- candidate list response
|
||||
- tag response
|
||||
- timeline response
|
||||
- manual learning trigger response
|
||||
- memory/tag management payloads
|
||||
|
||||
Acceptance criteria:
|
||||
- Schemas match the intended `/api/brain` response shapes.
|
||||
- Timeline and traceability structures are explicit, not loosely typed blobs.
|
||||
|
||||
---
|
||||
|
||||
## B. Backend Service Tasks
|
||||
|
||||
### B1. Create brain event ingestion service
|
||||
Add:
|
||||
- `backend/app/services/brain_event_service.py`
|
||||
|
||||
Responsibilities:
|
||||
- normalize source records into `BrainEvent`
|
||||
- expose helpers such as:
|
||||
- `record_conversation_event(...)`
|
||||
- `record_document_event(...)`
|
||||
- `record_todo_event(...)`
|
||||
- `record_task_event(...)`
|
||||
- `record_forum_event(...)`
|
||||
|
||||
Acceptance criteria:
|
||||
- Each helper accepts current source-domain inputs without forcing those modules to understand brain internals.
|
||||
- Event creation is idempotent enough to avoid obvious duplicate rows for the same source update.
|
||||
|
||||
### B2. Create brain learning service
|
||||
Add:
|
||||
- `backend/app/services/brain_learning_service.py`
|
||||
|
||||
Responsibilities:
|
||||
- load pending `BrainEvent`s for a given date/user scope
|
||||
- cluster related events
|
||||
- call the LLM to create candidate knowledge
|
||||
- score and dedupe candidates
|
||||
- promote high-confidence candidates into `BrainMemory`
|
||||
- mark processed events and candidate statuses
|
||||
|
||||
Acceptance criteria:
|
||||
- Service supports both manual run and scheduler run.
|
||||
- Promotion/rejection decisions are explicit and testable.
|
||||
- Source event traceability is preserved.
|
||||
|
||||
### B3. Create brain tag service
|
||||
Add:
|
||||
- `backend/app/services/brain_tag_service.py`
|
||||
|
||||
Responsibilities:
|
||||
- attach and score tags
|
||||
- split tags into important vs secondary
|
||||
- update tag scores after learning runs
|
||||
- support cleanup recommendations
|
||||
|
||||
Acceptance criteria:
|
||||
- Important/secondary classification is persisted, not only computed in the UI.
|
||||
- Tag lookups support filtering memories and timeline entries.
|
||||
|
||||
### B4. Create brain retrieval service
|
||||
Add:
|
||||
- `backend/app/services/brain_retrieval_service.py`
|
||||
|
||||
Responsibilities:
|
||||
- retrieve relevant `BrainMemory` records by query
|
||||
- optionally retrieve recent events for recency-sensitive prompts
|
||||
- format results for chat injection and API responses
|
||||
|
||||
Acceptance criteria:
|
||||
- Retrieval has strict limits to prevent prompt bloat.
|
||||
- Results support filtering by tags, source type, and time range.
|
||||
|
||||
### B5. Refactor or extend memory service
|
||||
Update:
|
||||
- `backend/app/services/memory_service.py`
|
||||
|
||||
Tasks:
|
||||
- keep existing summary and `UserMemory` behavior intact
|
||||
- extend `build_memory_context()` to append a `【知识大脑】` block from `BrainRetrievalService`
|
||||
- keep memory context size bounded
|
||||
|
||||
Acceptance criteria:
|
||||
- Existing conversation summary behavior continues to work.
|
||||
- Chat can consume `BrainMemory` without requiring a full prompt architecture rewrite.
|
||||
|
||||
---
|
||||
|
||||
## C. Source Ingestion Integration Tasks
|
||||
|
||||
### C1. Conversation → BrainEvent
|
||||
Update likely files:
|
||||
- `backend/app/services/agent_service.py`
|
||||
- possibly `backend/app/services/memory_service.py`
|
||||
|
||||
Hook points:
|
||||
- after user message persistence
|
||||
- after assistant response persistence
|
||||
- after summary/memory extraction
|
||||
|
||||
Acceptance criteria:
|
||||
- Important conversation actions produce normalized `BrainEvent`s.
|
||||
- Explicit “remember this” signals are captured as stronger events.
|
||||
|
||||
### C2. Document → BrainEvent
|
||||
Update likely files:
|
||||
- `backend/app/routers/document.py`
|
||||
- `backend/app/services/document_service.py`
|
||||
- `backend/app/services/knowledge_service.py`
|
||||
|
||||
Hook points:
|
||||
- upload success
|
||||
- indexing completion
|
||||
- chunk edit / reindex
|
||||
|
||||
Acceptance criteria:
|
||||
- Document lifecycle milestones become `BrainEvent`s.
|
||||
- Source metadata includes document identity and folder context.
|
||||
|
||||
### C3. Todo → BrainEvent
|
||||
Update likely files:
|
||||
- `backend/app/routers/todo.py`
|
||||
- `backend/app/services/todo_service.py`
|
||||
|
||||
Hook points:
|
||||
- todo creation
|
||||
- completion
|
||||
- AI-generated todo creation
|
||||
|
||||
Acceptance criteria:
|
||||
- Todo events reflect both planning and completion signals.
|
||||
- AI-generated todos are distinguishable from manual ones.
|
||||
|
||||
### C4. Task/Kanban → BrainEvent
|
||||
Update likely files:
|
||||
- `backend/app/routers/task.py`
|
||||
|
||||
Hook points:
|
||||
- task creation
|
||||
- status change
|
||||
- completion
|
||||
- priority change
|
||||
|
||||
Acceptance criteria:
|
||||
- Task state changes create meaningful workstream events.
|
||||
- Duplicate writes are avoided on no-op updates.
|
||||
|
||||
### C5. Forum → BrainEvent
|
||||
Update likely files:
|
||||
- `backend/app/routers/forum.py`
|
||||
- optionally `backend/app/services/scheduler_service.py`
|
||||
|
||||
Hook points:
|
||||
- post created
|
||||
- reply created
|
||||
- forum instruction execution
|
||||
|
||||
Acceptance criteria:
|
||||
- Forum posts/replies that matter to project state become brain events.
|
||||
- Source traceability includes whether the event came from a post, reply, or executed instruction.
|
||||
|
||||
---
|
||||
|
||||
## D. Scheduler and Daily Learning Tasks
|
||||
|
||||
### D1. Add daily brain learning job
|
||||
Update:
|
||||
- `backend/app/services/scheduler_service.py`
|
||||
|
||||
Add:
|
||||
- `brain_daily_learning_task()`
|
||||
|
||||
Responsibilities:
|
||||
- run daily for pending events
|
||||
- invoke `BrainLearningService`
|
||||
- log promoted/rejected counts
|
||||
|
||||
Acceptance criteria:
|
||||
- Job is registered in `start_scheduler()`.
|
||||
- Job can run safely when there are no pending events.
|
||||
|
||||
### D2. Add manual trigger path
|
||||
Update or add:
|
||||
- `backend/app/routers/scheduler.py` or the new `backend/app/routers/brain.py`
|
||||
|
||||
Acceptance criteria:
|
||||
- Developers/users can manually run learning for testing.
|
||||
- Trigger returns a useful summary, not only a started flag.
|
||||
|
||||
### D3. Decide scheduler ownership model for phase 1
|
||||
Current scheduler is global. Decide whether phase 1 runs:
|
||||
- for all users in one job, or
|
||||
- per user loop inside one job
|
||||
|
||||
Acceptance criteria:
|
||||
- No hard-coded `user_id="default"` behavior remains in new brain learning flow.
|
||||
- User iteration strategy is explicit.
|
||||
|
||||
---
|
||||
|
||||
## E. Backend API Tasks
|
||||
|
||||
### E1. Add brain router
|
||||
Create:
|
||||
- `backend/app/routers/brain.py`
|
||||
|
||||
Register in:
|
||||
- `backend/app/main.py`
|
||||
- `backend/app/routers/__init__.py` if needed
|
||||
|
||||
### E2. Implement overview endpoint
|
||||
Endpoint:
|
||||
- `GET /api/brain/overview`
|
||||
|
||||
Should return:
|
||||
- active memory count
|
||||
- candidate count
|
||||
- important tag count
|
||||
- recent event count
|
||||
- last learning run info
|
||||
- today’s promoted/rejected summary
|
||||
|
||||
### E3. Implement memory endpoints
|
||||
Endpoints:
|
||||
- `GET /api/brain/memories`
|
||||
- `GET /api/brain/memory/{id}`
|
||||
- `POST /api/brain/memory/{id}/archive`
|
||||
- `DELETE /api/brain/memory/{id}`
|
||||
- optional `POST /api/brain/memory/{id}/promote` if candidate-to-memory management is exposed here
|
||||
|
||||
Acceptance criteria:
|
||||
- Memory detail shows source traceability and tags.
|
||||
- List endpoint supports pagination/filters needed by UI.
|
||||
|
||||
### E4. Implement candidate endpoints
|
||||
Endpoints:
|
||||
- `GET /api/brain/candidates`
|
||||
- optional promote/reject endpoints if candidates are user-manageable in phase 1
|
||||
|
||||
Acceptance criteria:
|
||||
- Candidate status and scoring are inspectable.
|
||||
|
||||
### E5. Implement tag endpoints
|
||||
Endpoints:
|
||||
- `GET /api/brain/tags`
|
||||
- `POST /api/brain/tag/{id}/promote`
|
||||
- `POST /api/brain/tag/{id}/demote`
|
||||
- `DELETE /api/brain/tag/{id}`
|
||||
|
||||
Acceptance criteria:
|
||||
- API groups tags by important vs secondary.
|
||||
- Manual cleanup actions are supported.
|
||||
|
||||
### E6. Implement timeline endpoint
|
||||
Endpoint:
|
||||
- `GET /api/brain/timeline`
|
||||
|
||||
Acceptance criteria:
|
||||
- Timeline groups records by day or returns a structure easily grouped by day in UI.
|
||||
- Includes event entries and memory promotion entries.
|
||||
|
||||
### E7. Implement learning trigger endpoint
|
||||
Endpoint:
|
||||
- `POST /api/brain/learn/run`
|
||||
|
||||
Acceptance criteria:
|
||||
- Supports manual learning run for current user or all users, depending on phase-1 policy.
|
||||
- Returns meaningful run stats.
|
||||
|
||||
---
|
||||
|
||||
## F. Chat Integration Tasks
|
||||
|
||||
### F1. Inject knowledge brain into chat context
|
||||
Update:
|
||||
- `backend/app/services/agent_service.py`
|
||||
- `backend/app/services/memory_service.py`
|
||||
|
||||
Acceptance criteria:
|
||||
- Relevant `BrainMemory` items appear in prompt context.
|
||||
- Context remains concise and bounded.
|
||||
- Existing response flow remains stable.
|
||||
|
||||
### F2. Add retrieval policy guardrails
|
||||
Tasks:
|
||||
- define per-query memory limits
|
||||
- choose when to include recent events
|
||||
- prefer important/high-confidence memories
|
||||
|
||||
Acceptance criteria:
|
||||
- Brain retrieval does not overwhelm standard conversation context.
|
||||
- Time-sensitive answers can still include recent context when needed.
|
||||
|
||||
---
|
||||
|
||||
## G. Frontend Route and Navigation Tasks
|
||||
|
||||
### G1. Introduce a real brain route
|
||||
Update likely files:
|
||||
- `frontend/src/app/router/routes.ts`
|
||||
- `frontend/src/app/navigation/nav.ts`
|
||||
|
||||
Tasks:
|
||||
- add `/brain`
|
||||
- make `知识大脑` point to `/brain`
|
||||
- keep `/graph` available as a subview or secondary route
|
||||
|
||||
Acceptance criteria:
|
||||
- Brain is no longer represented only by the graph page.
|
||||
|
||||
### G2. Define frontend brain API client
|
||||
Add:
|
||||
- `frontend/src/api/brain.ts`
|
||||
|
||||
Methods:
|
||||
- `getOverview`
|
||||
- `getMemories`
|
||||
- `getMemoryDetail`
|
||||
- `getCandidates`
|
||||
- `getTags`
|
||||
- `getTimeline`
|
||||
- `runLearning`
|
||||
- memory/tag management actions
|
||||
|
||||
Acceptance criteria:
|
||||
- API client matches backend router contract.
|
||||
|
||||
---
|
||||
|
||||
## H. Frontend Brain Dashboard Tasks
|
||||
|
||||
### H1. Create new brain page
|
||||
Add:
|
||||
- `frontend/src/pages/brain/index.vue`
|
||||
|
||||
Core page sections:
|
||||
- overview header
|
||||
- important tags panel
|
||||
- secondary tags panel
|
||||
- recent learned knowledge section
|
||||
- timeline section
|
||||
- graph tab/subview entry
|
||||
|
||||
Acceptance criteria:
|
||||
- Page is useful even before graph projection is upgraded.
|
||||
- Dashboard reflects the brain, not just visualized relationships.
|
||||
|
||||
### H2. Add page composable/state logic
|
||||
Add:
|
||||
- `frontend/src/pages/brain/composables/useBrainView.ts`
|
||||
|
||||
Responsibilities:
|
||||
- fetch overview/tags/memories/timeline
|
||||
- manage filters and selected tags
|
||||
- trigger manual learning run
|
||||
- manage loading/error states
|
||||
|
||||
Acceptance criteria:
|
||||
- Page logic stays separate from template complexity.
|
||||
|
||||
### H3. Add memory list/detail components
|
||||
Suggested additions:
|
||||
- `frontend/src/components/brain/BrainMemoryList.vue`
|
||||
- `frontend/src/components/brain/BrainMemoryDetail.vue`
|
||||
- `frontend/src/components/brain/BrainTagPanel.vue`
|
||||
- `frontend/src/components/brain/BrainTimeline.vue`
|
||||
|
||||
Acceptance criteria:
|
||||
- User can inspect why a memory exists.
|
||||
- User can archive/delete memories and promote/demote tags.
|
||||
|
||||
### H4. Reposition graph as brain subview
|
||||
Possible approaches:
|
||||
- keep current `frontend/src/pages/graph/index.vue` but link it from `/brain`
|
||||
- or wrap the graph page as one tab inside the brain page
|
||||
|
||||
Acceptance criteria:
|
||||
- Existing graph functionality remains accessible.
|
||||
- Product framing changes from “brain = graph” to “brain includes graph”.
|
||||
|
||||
---
|
||||
|
||||
## I. Testing Tasks
|
||||
|
||||
### I1. Backend model/service tests
|
||||
Add tests for:
|
||||
- event creation
|
||||
- candidate generation status changes
|
||||
- promotion into `BrainMemory`
|
||||
- tag priority updates
|
||||
- timeline aggregation
|
||||
|
||||
Suggested locations:
|
||||
- `backend/tests/backend/app/services/`
|
||||
- `backend/tests/backend/app/routers/`
|
||||
|
||||
### I2. Retrieval integration tests
|
||||
Add tests for:
|
||||
- memory context injection
|
||||
- retrieval limits
|
||||
- recency-sensitive event inclusion
|
||||
|
||||
### I3. API tests
|
||||
Add tests for:
|
||||
- `/api/brain/overview`
|
||||
- `/api/brain/memories`
|
||||
- `/api/brain/tags`
|
||||
- `/api/brain/timeline`
|
||||
- `/api/brain/learn/run`
|
||||
|
||||
### I4. Frontend tests
|
||||
Add tests for:
|
||||
- brain composable fetch flow
|
||||
- filter behavior
|
||||
- manual learning run UI flow
|
||||
- tag grouping and memory rendering
|
||||
|
||||
---
|
||||
|
||||
## J. Recommended Execution Order
|
||||
|
||||
### Wave 1: Foundation
|
||||
1. A1-A4 persistence and schemas
|
||||
2. B1 brain event service
|
||||
3. E1 add router skeleton
|
||||
|
||||
### Wave 2: Ingestion
|
||||
4. C1-C5 connect all source domains to `BrainEvent`
|
||||
|
||||
### Wave 3: Learning
|
||||
5. B2 brain learning service
|
||||
6. B3 brain tag service
|
||||
7. D1-D3 scheduler/manual learning
|
||||
|
||||
### Wave 4: Retrieval
|
||||
8. B4 brain retrieval service
|
||||
9. B5 memory service integration
|
||||
10. F1-F2 chat injection and guardrails
|
||||
|
||||
### Wave 5: Product surface
|
||||
11. E2-E7 complete `/api/brain` endpoints
|
||||
12. G1-G2 routing + API client
|
||||
13. H1-H4 dashboard and graph repositioning
|
||||
|
||||
### Wave 6: Reliability
|
||||
14. I1-I4 tests and refinement
|
||||
|
||||
---
|
||||
|
||||
## K. Files Most Likely to Change in Phase 1
|
||||
|
||||
### Backend new files
|
||||
- `backend/app/models/brain_event.py`
|
||||
- `backend/app/models/brain_candidate.py`
|
||||
- `backend/app/models/brain_memory.py`
|
||||
- `backend/app/models/brain_tag.py`
|
||||
- `backend/app/schemas/brain.py`
|
||||
- `backend/app/services/brain_event_service.py`
|
||||
- `backend/app/services/brain_learning_service.py`
|
||||
- `backend/app/services/brain_tag_service.py`
|
||||
- `backend/app/services/brain_retrieval_service.py`
|
||||
- `backend/app/routers/brain.py`
|
||||
|
||||
### Backend existing files
|
||||
- `backend/app/models/__init__.py`
|
||||
- `backend/app/main.py`
|
||||
- `backend/app/services/memory_service.py`
|
||||
- `backend/app/services/agent_service.py`
|
||||
- `backend/app/services/scheduler_service.py`
|
||||
- `backend/app/routers/document.py`
|
||||
- `backend/app/routers/todo.py`
|
||||
- `backend/app/routers/task.py`
|
||||
- `backend/app/routers/forum.py`
|
||||
- possibly `backend/app/services/document_service.py`
|
||||
- possibly `backend/app/services/knowledge_service.py`
|
||||
|
||||
### Frontend new files
|
||||
- `frontend/src/api/brain.ts`
|
||||
- `frontend/src/pages/brain/index.vue`
|
||||
- `frontend/src/pages/brain/composables/useBrainView.ts`
|
||||
- brain-related components under `frontend/src/components/brain/`
|
||||
|
||||
### Frontend existing files
|
||||
- `frontend/src/app/router/routes.ts`
|
||||
- `frontend/src/app/navigation/nav.ts`
|
||||
- optionally `frontend/src/pages/graph/index.vue`
|
||||
|
||||
---
|
||||
|
||||
## L. Phase 1 “Definition of Done” Checklist
|
||||
- [ ] Brain persistence models exist and are queryable.
|
||||
- [ ] All five core domains emit `BrainEvent`s.
|
||||
- [ ] Daily learning creates `BrainCandidate`s and promotes durable `BrainMemory`s.
|
||||
- [ ] Tag priority is stored and manageable.
|
||||
- [ ] Chat can retrieve relevant brain knowledge.
|
||||
- [ ] `/api/brain` endpoints support dashboard and management actions.
|
||||
- [ ] `/brain` dashboard exists and is usable without relying on the graph page.
|
||||
- [ ] Graph remains available as a secondary/projection view.
|
||||
- [ ] Automated tests cover ingestion, promotion, retrieval, and UI basics.
|
||||
@@ -0,0 +1,27 @@
|
||||
# Task Plan: Jarvis Knowledge Brain Phase 1 Blueprint
|
||||
|
||||
## Goal
|
||||
Create a practical phase-1 implementation blueprint for the event-driven knowledge brain, covering backend models, services, scheduler jobs, retrieval integration, APIs, and frontend brain module structure.
|
||||
|
||||
## Phases
|
||||
- [x] Phase 1: Plan and setup
|
||||
- [x] Phase 2: Research/gather information
|
||||
- [x] Phase 3: Draft blueprint
|
||||
- [x] Phase 4: Review and deliver
|
||||
|
||||
## Key Questions
|
||||
1. Which new persistence models are required for an event-driven knowledge brain?
|
||||
2. How should existing conversation, document, todo, task, and forum data flow into the brain?
|
||||
3. What should phase 1 include versus defer to later phases?
|
||||
4. How should the frontend brain module be structured before full graph intelligence exists?
|
||||
|
||||
## Decisions Made
|
||||
- Use an event-driven brain architecture instead of extending the current graph-only flow.
|
||||
- Keep the current graph as a projection/view layer, not the brain source of truth.
|
||||
- Phase 1 should prioritize unified ingestion, candidate generation, long-term memory storage, and retrieval integration.
|
||||
|
||||
## Errors Encountered
|
||||
- None yet.
|
||||
|
||||
## Status
|
||||
**Completed** - Separate implementation plan drafted in `knowledge_ingestion_plan.md` and supporting notes updated.
|
||||
@@ -0,0 +1,210 @@
|
||||
# Knowledge Ingestion Normalization Plan
|
||||
|
||||
## Goal
|
||||
Introduce a unified structured-markdown ingestion pipeline for the knowledge center: MinerU for PDF, existing parsers for DOCX/XLSX/CSV/MD/TXT, persisted normalized content, and lightweight hierarchical chunk semantics.
|
||||
|
||||
## Scope
|
||||
- Backend document parsing and normalization flow
|
||||
- Document persistence model updates
|
||||
- Incremental retrieval/indexing integration
|
||||
- Backfill/reindex strategy for existing documents
|
||||
- Test strategy for parser, router, and migration behavior
|
||||
|
||||
## Non-Goals
|
||||
- Full parent-child chunk graph tables in this phase
|
||||
- Rewriting all chunking logic to markdown-first immediately
|
||||
- Replacing all non-PDF parsers with a new framework
|
||||
- Solving every OCR/image-understanding case in the first pass
|
||||
|
||||
## Architecture Decisions
|
||||
- **PDF parser:** MinerU
|
||||
- **Other parsers:** keep current implementations for DOCX/XLSX/CSV/MD/TXT
|
||||
- **Canonical intermediate representation:** `ParsedDocument + structured_markdown`
|
||||
- **Canonical persisted content:** add `normalized_content` to `documents`
|
||||
- **Hierarchy model:** metadata-based lightweight semantics, not hard foreign-key parent-child chunk tables
|
||||
- **Migration strategy:** additive schema change + on-demand rebuild/reindex
|
||||
|
||||
## Target Flow
|
||||
1. Upload file
|
||||
2. Parse by type
|
||||
- PDF -> MinerU -> normalize to ParsedDocument
|
||||
- Other formats -> current parser -> ParsedDocument
|
||||
3. Render `ParsedDocument` into `structured_markdown`
|
||||
4. Persist document record including `normalized_content`
|
||||
5. Build chunks (initially still from nodes, enriched with lightweight hierarchy metadata)
|
||||
6. Index into vector store
|
||||
7. Serve preview from `normalized_content`
|
||||
|
||||
## Data Model Changes
|
||||
### documents table
|
||||
Add fields:
|
||||
- `normalized_content TEXT NULL`
|
||||
- `normalized_format VARCHAR(50) NULL` (value like `structured_markdown`)
|
||||
- optional later: `normalization_version VARCHAR(50) NULL`
|
||||
|
||||
### document_chunks metadata
|
||||
Enrich chunk metadata with lightweight hierarchy keys:
|
||||
- `chunk_level`
|
||||
- `parent_key`
|
||||
- `block_key`
|
||||
- existing structural metadata remains (`section_path`, `section_title`, `page_number`, `sheet_name`, `row_start`, `row_end`, `content_type`)
|
||||
|
||||
Rationale:
|
||||
- Supports grouped retrieval and contextual reconstruction
|
||||
- Avoids introducing a relational chunk tree prematurely
|
||||
|
||||
## Backend Implementation Steps
|
||||
### Phase 1: Schema and persistence
|
||||
Files:
|
||||
- `backend/app/models/document.py`
|
||||
- `backend/app/database.py`
|
||||
- `backend/app/schemas/document.py`
|
||||
- tests under `backend/tests/backend/app`
|
||||
|
||||
Changes:
|
||||
- Add `normalized_content` and `normalized_format` to `Document`
|
||||
- Extend `ensure_document_columns()` to backfill the new columns for existing databases
|
||||
- Expose `normalized_content` only where needed for preview/read APIs (avoid broad API expansion if not required yet)
|
||||
|
||||
### Phase 2: Introduce structured markdown renderer
|
||||
Files:
|
||||
- `backend/app/services/document_service.py`
|
||||
- possibly a new helper module if the renderer gets too large, but prefer keeping it local initially
|
||||
|
||||
Changes:
|
||||
- Add `_render_structured_markdown(parsed: ParsedDocument) -> str`
|
||||
- Keep current per-format parsing functions
|
||||
- After parsing, render once and store into `document.normalized_content`
|
||||
- Add `normalized_format='structured_markdown'`
|
||||
|
||||
Rendering guidance:
|
||||
- headings -> markdown headings
|
||||
- paragraphs/text -> plain markdown paragraphs
|
||||
- CSV/XLSX tables -> markdown table blocks or fenced structured table blocks when tables are too large/wide
|
||||
- PDF page boundaries -> explicit page markers
|
||||
- preserve contextual markers in metadata even if markdown cannot express everything perfectly
|
||||
|
||||
### Phase 3: MinerU integration for PDF
|
||||
Files:
|
||||
- `backend/app/services/document_service.py`
|
||||
- `backend/pyproject.toml` / lockfile if dependencies are added
|
||||
- config if MinerU requires configurable paths/options
|
||||
|
||||
Changes:
|
||||
- Replace PDF branch with MinerU-backed parsing
|
||||
- Map MinerU output into internal `ParsedNode`/`ParsedDocument`
|
||||
- Preserve page and block order
|
||||
- Represent image blocks as markdown placeholders plus metadata
|
||||
|
||||
Image policy:
|
||||
- First pass: extract image block references, page number, nearby text, and optional captions
|
||||
- Do not perform full image understanding for every image in phase 1
|
||||
- Design metadata so high-value image understanding can be added later
|
||||
|
||||
### Phase 4: Chunk metadata enrichment
|
||||
Files:
|
||||
- `backend/app/services/document_service.py`
|
||||
- `backend/app/services/knowledge_service.py`
|
||||
- tests
|
||||
|
||||
Changes:
|
||||
- Extend `_build_chunks()` to include lightweight hierarchy metadata:
|
||||
- section headings become natural parent keys
|
||||
- row batches / sheet blocks get stable block keys
|
||||
- PDF page/section blocks preserve ordered grouping
|
||||
- Keep current retrieval behavior, but let `_get_related_chunks()` benefit from richer metadata if helpful
|
||||
|
||||
### Phase 5: Preview and rebuild behavior
|
||||
Files:
|
||||
- `backend/app/routers/document.py`
|
||||
- `backend/app/services/document_service.py`
|
||||
|
||||
Changes:
|
||||
- `get_document_content()` should prefer `normalized_content`
|
||||
- Fallback to legacy file reading only when normalized content is absent
|
||||
- Rebuild/reindex paths should regenerate normalized content before chunk rebuild/indexing
|
||||
|
||||
### Phase 6: Backfill strategy
|
||||
Approach:
|
||||
- Add a rebuild endpoint or reuse existing reindex flow to backfill `normalized_content`
|
||||
- Existing documents can be migrated lazily:
|
||||
- when opened
|
||||
- when reindexed
|
||||
- or via an admin/batch rebuild command later
|
||||
|
||||
This avoids a risky one-shot migration.
|
||||
|
||||
## Error Handling Changes
|
||||
Current issue:
|
||||
- Upload route can leak parser/dependency problems as generic 500s.
|
||||
|
||||
Changes:
|
||||
- Convert expected parser/business errors to explicit 4xx responses where appropriate
|
||||
- For missing optional parser dependencies, return clear messages such as:
|
||||
- `DOCX parsing dependency missing: python-docx`
|
||||
- `PDF parsing dependency missing/configuration invalid`
|
||||
- Keep true unexpected exceptions as 500s
|
||||
|
||||
Files:
|
||||
- `backend/app/routers/document.py`
|
||||
- `backend/app/services/document_service.py`
|
||||
|
||||
## Testing Plan
|
||||
### Backend unit/integration tests
|
||||
1. Schema migration test for new `documents` columns
|
||||
2. Renderer tests:
|
||||
- markdown headings preserved
|
||||
- section paths retained in metadata
|
||||
- xlsx/csv table blocks rendered predictably
|
||||
- pdf page markers preserved from MinerU mapping
|
||||
3. Upload tests:
|
||||
- successful DOCX/XLSX/CSV/MD/TXT upload stores `normalized_content`
|
||||
- PDF upload stores `normalized_content`
|
||||
- missing dependency returns clear error instead of generic 500 where applicable
|
||||
4. Rebuild/reindex tests:
|
||||
- normalized content regenerated
|
||||
- chunks rebuilt with hierarchy metadata
|
||||
5. Retrieval tests:
|
||||
- related chunk lookup still works with enriched metadata
|
||||
|
||||
### Frontend tests
|
||||
Only if the UI surfaces normalized preview directly in this phase:
|
||||
- knowledge view preview prefers normalized content from API
|
||||
- no regression in upload and refresh persistence behavior
|
||||
|
||||
## Suggested Execution Order
|
||||
1. Add schema fields + migration guard
|
||||
2. Add structured markdown renderer for current parsers
|
||||
3. Store normalized content on upload
|
||||
4. Update content preview to read normalized content first
|
||||
5. Enrich chunk metadata with lightweight hierarchy keys
|
||||
6. Integrate MinerU for PDF
|
||||
7. Add rebuild/backfill path
|
||||
8. Expand tests
|
||||
|
||||
## Risks and Mitigations
|
||||
### Risk: MinerU integration complexity
|
||||
Mitigation:
|
||||
- isolate MinerU to PDF branch only
|
||||
- keep internal ParsedDocument contract stable
|
||||
|
||||
### Risk: markdown rendering loses structure
|
||||
Mitigation:
|
||||
- preserve critical structure in metadata
|
||||
- use explicit block markers for page/sheet/table boundaries
|
||||
|
||||
### Risk: broad retrieval regressions
|
||||
Mitigation:
|
||||
- keep chunking source node-based initially
|
||||
- change one layer at a time
|
||||
|
||||
### Risk: old documents lack normalized content
|
||||
Mitigation:
|
||||
- lazy backfill during preview/reindex
|
||||
|
||||
## Deliverable Recommendation
|
||||
Implement in small PR-sized slices:
|
||||
1. schema + normalized renderer + preview fallback
|
||||
2. hierarchy metadata enrichment
|
||||
3. MinerU PDF integration
|
||||
4. rebuild/backfill tooling
|
||||
Reference in New Issue
Block a user