Align the brain prompts, graph view, and startup defaults with the latest phase 1 flow so local runs and navigation stay consistent.
7.8 KiB
7.8 KiB
Knowledge Ingestion Normalization Plan
Goal
Introduce a unified structured-markdown ingestion pipeline for the knowledge center: MinerU for PDF, existing parsers for DOCX/XLSX/CSV/MD/TXT, persisted normalized content, and lightweight hierarchical chunk semantics.
Scope
- Backend document parsing and normalization flow
- Document persistence model updates
- Incremental retrieval/indexing integration
- Backfill/reindex strategy for existing documents
- Test strategy for parser, router, and migration behavior
Non-Goals
- Full parent-child chunk graph tables in this phase
- Rewriting all chunking logic to markdown-first immediately
- Replacing all non-PDF parsers with a new framework
- Solving every OCR/image-understanding case in the first pass
Architecture Decisions
- PDF parser: MinerU
- Other parsers: keep current implementations for DOCX/XLSX/CSV/MD/TXT
- Canonical intermediate representation:
ParsedDocument + structured_markdown - Canonical persisted content: add
normalized_contenttodocuments - Hierarchy model: metadata-based lightweight semantics, not hard foreign-key parent-child chunk tables
- Migration strategy: additive schema change + on-demand rebuild/reindex
Target Flow
- Upload file
- Parse by type
- PDF -> MinerU -> normalize to ParsedDocument
- Other formats -> current parser -> ParsedDocument
- Render
ParsedDocumentintostructured_markdown - Persist document record including
normalized_content - Build chunks (initially still from nodes, enriched with lightweight hierarchy metadata)
- Index into vector store
- Serve preview from
normalized_content
Data Model Changes
documents table
Add fields:
normalized_content TEXT NULLnormalized_format VARCHAR(50) NULL(value likestructured_markdown)- optional later:
normalization_version VARCHAR(50) NULL
document_chunks metadata
Enrich chunk metadata with lightweight hierarchy keys:
chunk_levelparent_keyblock_key- existing structural metadata remains (
section_path,section_title,page_number,sheet_name,row_start,row_end,content_type)
Rationale:
- Supports grouped retrieval and contextual reconstruction
- Avoids introducing a relational chunk tree prematurely
Backend Implementation Steps
Phase 1: Schema and persistence
Files:
backend/app/models/document.pybackend/app/database.pybackend/app/schemas/document.py- tests under
backend/tests/backend/app
Changes:
- Add
normalized_contentandnormalized_formattoDocument - Extend
ensure_document_columns()to backfill the new columns for existing databases - Expose
normalized_contentonly where needed for preview/read APIs (avoid broad API expansion if not required yet)
Phase 2: Introduce structured markdown renderer
Files:
backend/app/services/document_service.py- possibly a new helper module if the renderer gets too large, but prefer keeping it local initially
Changes:
- Add
_render_structured_markdown(parsed: ParsedDocument) -> str - Keep current per-format parsing functions
- After parsing, render once and store into
document.normalized_content - Add
normalized_format='structured_markdown'
Rendering guidance:
- headings -> markdown headings
- paragraphs/text -> plain markdown paragraphs
- CSV/XLSX tables -> markdown table blocks or fenced structured table blocks when tables are too large/wide
- PDF page boundaries -> explicit page markers
- preserve contextual markers in metadata even if markdown cannot express everything perfectly
Phase 3: MinerU integration for PDF
Files:
backend/app/services/document_service.pybackend/pyproject.toml/ lockfile if dependencies are added- config if MinerU requires configurable paths/options
Changes:
- Replace PDF branch with MinerU-backed parsing
- Map MinerU output into internal
ParsedNode/ParsedDocument - Preserve page and block order
- Represent image blocks as markdown placeholders plus metadata
Image policy:
- First pass: extract image block references, page number, nearby text, and optional captions
- Do not perform full image understanding for every image in phase 1
- Design metadata so high-value image understanding can be added later
Phase 4: Chunk metadata enrichment
Files:
backend/app/services/document_service.pybackend/app/services/knowledge_service.py- tests
Changes:
- Extend
_build_chunks()to include lightweight hierarchy metadata:- section headings become natural parent keys
- row batches / sheet blocks get stable block keys
- PDF page/section blocks preserve ordered grouping
- Keep current retrieval behavior, but let
_get_related_chunks()benefit from richer metadata if helpful
Phase 5: Preview and rebuild behavior
Files:
backend/app/routers/document.pybackend/app/services/document_service.py
Changes:
get_document_content()should prefernormalized_content- Fallback to legacy file reading only when normalized content is absent
- Rebuild/reindex paths should regenerate normalized content before chunk rebuild/indexing
Phase 6: Backfill strategy
Approach:
- Add a rebuild endpoint or reuse existing reindex flow to backfill
normalized_content - Existing documents can be migrated lazily:
- when opened
- when reindexed
- or via an admin/batch rebuild command later
This avoids a risky one-shot migration.
Error Handling Changes
Current issue:
- Upload route can leak parser/dependency problems as generic 500s.
Changes:
- Convert expected parser/business errors to explicit 4xx responses where appropriate
- For missing optional parser dependencies, return clear messages such as:
DOCX parsing dependency missing: python-docxPDF parsing dependency missing/configuration invalid
- Keep true unexpected exceptions as 500s
Files:
backend/app/routers/document.pybackend/app/services/document_service.py
Testing Plan
Backend unit/integration tests
- Schema migration test for new
documentscolumns - Renderer tests:
- markdown headings preserved
- section paths retained in metadata
- xlsx/csv table blocks rendered predictably
- pdf page markers preserved from MinerU mapping
- Upload tests:
- successful DOCX/XLSX/CSV/MD/TXT upload stores
normalized_content - PDF upload stores
normalized_content - missing dependency returns clear error instead of generic 500 where applicable
- successful DOCX/XLSX/CSV/MD/TXT upload stores
- Rebuild/reindex tests:
- normalized content regenerated
- chunks rebuilt with hierarchy metadata
- Retrieval tests:
- related chunk lookup still works with enriched metadata
Frontend tests
Only if the UI surfaces normalized preview directly in this phase:
- knowledge view preview prefers normalized content from API
- no regression in upload and refresh persistence behavior
Suggested Execution Order
- Add schema fields + migration guard
- Add structured markdown renderer for current parsers
- Store normalized content on upload
- Update content preview to read normalized content first
- Enrich chunk metadata with lightweight hierarchy keys
- Integrate MinerU for PDF
- Add rebuild/backfill path
- Expand tests
Risks and Mitigations
Risk: MinerU integration complexity
Mitigation:
- isolate MinerU to PDF branch only
- keep internal ParsedDocument contract stable
Risk: markdown rendering loses structure
Mitigation:
- preserve critical structure in metadata
- use explicit block markers for page/sheet/table boundaries
Risk: broad retrieval regressions
Mitigation:
- keep chunking source node-based initially
- change one layer at a time
Risk: old documents lack normalized content
Mitigation:
- lazy backfill during preview/reindex
Deliverable Recommendation
Implement in small PR-sized slices:
- schema + normalized renderer + preview fallback
- hierarchy metadata enrichment
- MinerU PDF integration
- rebuild/backfill tooling