Refine knowledge brain workflow
Align the brain prompts, graph view, and startup defaults with the latest phase 1 flow so local runs and navigation stay consistent.
This commit is contained in:
@@ -0,0 +1,210 @@
|
||||
# Knowledge Ingestion Normalization Plan
|
||||
|
||||
## Goal
|
||||
Introduce a unified structured-markdown ingestion pipeline for the knowledge center: MinerU for PDF, existing parsers for DOCX/XLSX/CSV/MD/TXT, persisted normalized content, and lightweight hierarchical chunk semantics.
|
||||
|
||||
## Scope
|
||||
- Backend document parsing and normalization flow
|
||||
- Document persistence model updates
|
||||
- Incremental retrieval/indexing integration
|
||||
- Backfill/reindex strategy for existing documents
|
||||
- Test strategy for parser, router, and migration behavior
|
||||
|
||||
## Non-Goals
|
||||
- Full parent-child chunk graph tables in this phase
|
||||
- Rewriting all chunking logic to markdown-first immediately
|
||||
- Replacing all non-PDF parsers with a new framework
|
||||
- Solving every OCR/image-understanding case in the first pass
|
||||
|
||||
## Architecture Decisions
|
||||
- **PDF parser:** MinerU
|
||||
- **Other parsers:** keep current implementations for DOCX/XLSX/CSV/MD/TXT
|
||||
- **Canonical intermediate representation:** `ParsedDocument + structured_markdown`
|
||||
- **Canonical persisted content:** add `normalized_content` to `documents`
|
||||
- **Hierarchy model:** metadata-based lightweight semantics, not hard foreign-key parent-child chunk tables
|
||||
- **Migration strategy:** additive schema change + on-demand rebuild/reindex
|
||||
|
||||
## Target Flow
|
||||
1. Upload file
|
||||
2. Parse by type
|
||||
- PDF -> MinerU -> normalize to ParsedDocument
|
||||
- Other formats -> current parser -> ParsedDocument
|
||||
3. Render `ParsedDocument` into `structured_markdown`
|
||||
4. Persist document record including `normalized_content`
|
||||
5. Build chunks (initially still from nodes, enriched with lightweight hierarchy metadata)
|
||||
6. Index into vector store
|
||||
7. Serve preview from `normalized_content`
|
||||
|
||||
## Data Model Changes
|
||||
### documents table
|
||||
Add fields:
|
||||
- `normalized_content TEXT NULL`
|
||||
- `normalized_format VARCHAR(50) NULL` (value like `structured_markdown`)
|
||||
- optional later: `normalization_version VARCHAR(50) NULL`
|
||||
|
||||
### document_chunks metadata
|
||||
Enrich chunk metadata with lightweight hierarchy keys:
|
||||
- `chunk_level`
|
||||
- `parent_key`
|
||||
- `block_key`
|
||||
- existing structural metadata remains (`section_path`, `section_title`, `page_number`, `sheet_name`, `row_start`, `row_end`, `content_type`)
|
||||
|
||||
Rationale:
|
||||
- Supports grouped retrieval and contextual reconstruction
|
||||
- Avoids introducing a relational chunk tree prematurely
|
||||
|
||||
## Backend Implementation Steps
|
||||
### Phase 1: Schema and persistence
|
||||
Files:
|
||||
- `backend/app/models/document.py`
|
||||
- `backend/app/database.py`
|
||||
- `backend/app/schemas/document.py`
|
||||
- tests under `backend/tests/backend/app`
|
||||
|
||||
Changes:
|
||||
- Add `normalized_content` and `normalized_format` to `Document`
|
||||
- Extend `ensure_document_columns()` to backfill the new columns for existing databases
|
||||
- Expose `normalized_content` only where needed for preview/read APIs (avoid broad API expansion if not required yet)
|
||||
|
||||
### Phase 2: Introduce structured markdown renderer
|
||||
Files:
|
||||
- `backend/app/services/document_service.py`
|
||||
- possibly a new helper module if the renderer gets too large, but prefer keeping it local initially
|
||||
|
||||
Changes:
|
||||
- Add `_render_structured_markdown(parsed: ParsedDocument) -> str`
|
||||
- Keep current per-format parsing functions
|
||||
- After parsing, render once and store into `document.normalized_content`
|
||||
- Add `normalized_format='structured_markdown'`
|
||||
|
||||
Rendering guidance:
|
||||
- headings -> markdown headings
|
||||
- paragraphs/text -> plain markdown paragraphs
|
||||
- CSV/XLSX tables -> markdown table blocks or fenced structured table blocks when tables are too large/wide
|
||||
- PDF page boundaries -> explicit page markers
|
||||
- preserve contextual markers in metadata even if markdown cannot express everything perfectly
|
||||
|
||||
### Phase 3: MinerU integration for PDF
|
||||
Files:
|
||||
- `backend/app/services/document_service.py`
|
||||
- `backend/pyproject.toml` / lockfile if dependencies are added
|
||||
- config if MinerU requires configurable paths/options
|
||||
|
||||
Changes:
|
||||
- Replace PDF branch with MinerU-backed parsing
|
||||
- Map MinerU output into internal `ParsedNode`/`ParsedDocument`
|
||||
- Preserve page and block order
|
||||
- Represent image blocks as markdown placeholders plus metadata
|
||||
|
||||
Image policy:
|
||||
- First pass: extract image block references, page number, nearby text, and optional captions
|
||||
- Do not perform full image understanding for every image in phase 1
|
||||
- Design metadata so high-value image understanding can be added later
|
||||
|
||||
### Phase 4: Chunk metadata enrichment
|
||||
Files:
|
||||
- `backend/app/services/document_service.py`
|
||||
- `backend/app/services/knowledge_service.py`
|
||||
- tests
|
||||
|
||||
Changes:
|
||||
- Extend `_build_chunks()` to include lightweight hierarchy metadata:
|
||||
- section headings become natural parent keys
|
||||
- row batches / sheet blocks get stable block keys
|
||||
- PDF page/section blocks preserve ordered grouping
|
||||
- Keep current retrieval behavior, but let `_get_related_chunks()` benefit from richer metadata if helpful
|
||||
|
||||
### Phase 5: Preview and rebuild behavior
|
||||
Files:
|
||||
- `backend/app/routers/document.py`
|
||||
- `backend/app/services/document_service.py`
|
||||
|
||||
Changes:
|
||||
- `get_document_content()` should prefer `normalized_content`
|
||||
- Fallback to legacy file reading only when normalized content is absent
|
||||
- Rebuild/reindex paths should regenerate normalized content before chunk rebuild/indexing
|
||||
|
||||
### Phase 6: Backfill strategy
|
||||
Approach:
|
||||
- Add a rebuild endpoint or reuse existing reindex flow to backfill `normalized_content`
|
||||
- Existing documents can be migrated lazily:
|
||||
- when opened
|
||||
- when reindexed
|
||||
- or via an admin/batch rebuild command later
|
||||
|
||||
This avoids a risky one-shot migration.
|
||||
|
||||
## Error Handling Changes
|
||||
Current issue:
|
||||
- Upload route can leak parser/dependency problems as generic 500s.
|
||||
|
||||
Changes:
|
||||
- Convert expected parser/business errors to explicit 4xx responses where appropriate
|
||||
- For missing optional parser dependencies, return clear messages such as:
|
||||
- `DOCX parsing dependency missing: python-docx`
|
||||
- `PDF parsing dependency missing/configuration invalid`
|
||||
- Keep true unexpected exceptions as 500s
|
||||
|
||||
Files:
|
||||
- `backend/app/routers/document.py`
|
||||
- `backend/app/services/document_service.py`
|
||||
|
||||
## Testing Plan
|
||||
### Backend unit/integration tests
|
||||
1. Schema migration test for new `documents` columns
|
||||
2. Renderer tests:
|
||||
- markdown headings preserved
|
||||
- section paths retained in metadata
|
||||
- xlsx/csv table blocks rendered predictably
|
||||
- pdf page markers preserved from MinerU mapping
|
||||
3. Upload tests:
|
||||
- successful DOCX/XLSX/CSV/MD/TXT upload stores `normalized_content`
|
||||
- PDF upload stores `normalized_content`
|
||||
- missing dependency returns clear error instead of generic 500 where applicable
|
||||
4. Rebuild/reindex tests:
|
||||
- normalized content regenerated
|
||||
- chunks rebuilt with hierarchy metadata
|
||||
5. Retrieval tests:
|
||||
- related chunk lookup still works with enriched metadata
|
||||
|
||||
### Frontend tests
|
||||
Only if the UI surfaces normalized preview directly in this phase:
|
||||
- knowledge view preview prefers normalized content from API
|
||||
- no regression in upload and refresh persistence behavior
|
||||
|
||||
## Suggested Execution Order
|
||||
1. Add schema fields + migration guard
|
||||
2. Add structured markdown renderer for current parsers
|
||||
3. Store normalized content on upload
|
||||
4. Update content preview to read normalized content first
|
||||
5. Enrich chunk metadata with lightweight hierarchy keys
|
||||
6. Integrate MinerU for PDF
|
||||
7. Add rebuild/backfill path
|
||||
8. Expand tests
|
||||
|
||||
## Risks and Mitigations
|
||||
### Risk: MinerU integration complexity
|
||||
Mitigation:
|
||||
- isolate MinerU to PDF branch only
|
||||
- keep internal ParsedDocument contract stable
|
||||
|
||||
### Risk: markdown rendering loses structure
|
||||
Mitigation:
|
||||
- preserve critical structure in metadata
|
||||
- use explicit block markers for page/sheet/table boundaries
|
||||
|
||||
### Risk: broad retrieval regressions
|
||||
Mitigation:
|
||||
- keep chunking source node-based initially
|
||||
- change one layer at a time
|
||||
|
||||
### Risk: old documents lack normalized content
|
||||
Mitigation:
|
||||
- lazy backfill during preview/reindex
|
||||
|
||||
## Deliverable Recommendation
|
||||
Implement in small PR-sized slices:
|
||||
1. schema + normalized renderer + preview fallback
|
||||
2. hierarchy metadata enrichment
|
||||
3. MinerU PDF integration
|
||||
4. rebuild/backfill tooling
|
||||
Reference in New Issue
Block a user