Capture the current local data snapshot and planning artifacts alongside this development batch so the workspace state matches the code changes. This preserves the reference materials and generated files that were kept in the working tree. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7.8 KiB
7.8 KiB
Knowledge Ingestion Normalization Plan
Goal
Introduce a unified structured-markdown ingestion pipeline for the knowledge center: MinerU for PDF, existing parsers for DOCX/XLSX/CSV/MD/TXT, persisted normalized content, and lightweight hierarchical chunk semantics.
Scope
- Backend document parsing and normalization flow
- Document persistence model updates
- Incremental retrieval/indexing integration
- Backfill/reindex strategy for existing documents
- Test strategy for parser, router, and migration behavior
Non-Goals
- Full parent-child chunk graph tables in this phase
- Rewriting all chunking logic to markdown-first immediately
- Replacing all non-PDF parsers with a new framework
- Solving every OCR/image-understanding case in the first pass
Architecture Decisions
- PDF parser: MinerU
- Other parsers: keep current implementations for DOCX/XLSX/CSV/MD/TXT
- Canonical intermediate representation:
ParsedDocument + structured_markdown - Canonical persisted content: add
normalized_contenttodocuments - Hierarchy model: metadata-based lightweight semantics, not hard foreign-key parent-child chunk tables
- Migration strategy: additive schema change + on-demand rebuild/reindex
Target Flow
- Upload file
- Parse by type
- PDF -> MinerU -> normalize to ParsedDocument
- Other formats -> current parser -> ParsedDocument
- Render
ParsedDocumentintostructured_markdown - Persist document record including
normalized_content - Build chunks (initially still from nodes, enriched with lightweight hierarchy metadata)
- Index into vector store
- Serve preview from
normalized_content
Data Model Changes
documents table
Add fields:
normalized_content TEXT NULLnormalized_format VARCHAR(50) NULL(value likestructured_markdown)- optional later:
normalization_version VARCHAR(50) NULL
document_chunks metadata
Enrich chunk metadata with lightweight hierarchy keys:
chunk_levelparent_keyblock_key- existing structural metadata remains (
section_path,section_title,page_number,sheet_name,row_start,row_end,content_type)
Rationale:
- Supports grouped retrieval and contextual reconstruction
- Avoids introducing a relational chunk tree prematurely
Backend Implementation Steps
Phase 1: Schema and persistence
Files:
backend/app/models/document.pybackend/app/database.pybackend/app/schemas/document.py- tests under
backend/tests/backend/app
Changes:
- Add
normalized_contentandnormalized_formattoDocument - Extend
ensure_document_columns()to backfill the new columns for existing databases - Expose
normalized_contentonly where needed for preview/read APIs (avoid broad API expansion if not required yet)
Phase 2: Introduce structured markdown renderer
Files:
backend/app/services/document_service.py- possibly a new helper module if the renderer gets too large, but prefer keeping it local initially
Changes:
- Add
_render_structured_markdown(parsed: ParsedDocument) -> str - Keep current per-format parsing functions
- After parsing, render once and store into
document.normalized_content - Add
normalized_format='structured_markdown'
Rendering guidance:
- headings -> markdown headings
- paragraphs/text -> plain markdown paragraphs
- CSV/XLSX tables -> markdown table blocks or fenced structured table blocks when tables are too large/wide
- PDF page boundaries -> explicit page markers
- preserve contextual markers in metadata even if markdown cannot express everything perfectly
Phase 3: MinerU integration for PDF
Files:
backend/app/services/document_service.pybackend/pyproject.toml/ lockfile if dependencies are added- config if MinerU requires configurable paths/options
Changes:
- Replace PDF branch with MinerU-backed parsing
- Map MinerU output into internal
ParsedNode/ParsedDocument - Preserve page and block order
- Represent image blocks as markdown placeholders plus metadata
Image policy:
- First pass: extract image block references, page number, nearby text, and optional captions
- Do not perform full image understanding for every image in phase 1
- Design metadata so high-value image understanding can be added later
Phase 4: Chunk metadata enrichment
Files:
backend/app/services/document_service.pybackend/app/services/knowledge_service.py- tests
Changes:
- Extend
_build_chunks()to include lightweight hierarchy metadata:- section headings become natural parent keys
- row batches / sheet blocks get stable block keys
- PDF page/section blocks preserve ordered grouping
- Keep current retrieval behavior, but let
_get_related_chunks()benefit from richer metadata if helpful
Phase 5: Preview and rebuild behavior
Files:
backend/app/routers/document.pybackend/app/services/document_service.py
Changes:
get_document_content()should prefernormalized_content- Fallback to legacy file reading only when normalized content is absent
- Rebuild/reindex paths should regenerate normalized content before chunk rebuild/indexing
Phase 6: Backfill strategy
Approach:
- Add a rebuild endpoint or reuse existing reindex flow to backfill
normalized_content - Existing documents can be migrated lazily:
- when opened
- when reindexed
- or via an admin/batch rebuild command later
This avoids a risky one-shot migration.
Error Handling Changes
Current issue:
- Upload route can leak parser/dependency problems as generic 500s.
Changes:
- Convert expected parser/business errors to explicit 4xx responses where appropriate
- For missing optional parser dependencies, return clear messages such as:
DOCX parsing dependency missing: python-docxPDF parsing dependency missing/configuration invalid
- Keep true unexpected exceptions as 500s
Files:
backend/app/routers/document.pybackend/app/services/document_service.py
Testing Plan
Backend unit/integration tests
- Schema migration test for new
documentscolumns - Renderer tests:
- markdown headings preserved
- section paths retained in metadata
- xlsx/csv table blocks rendered predictably
- pdf page markers preserved from MinerU mapping
- Upload tests:
- successful DOCX/XLSX/CSV/MD/TXT upload stores
normalized_content - PDF upload stores
normalized_content - missing dependency returns clear error instead of generic 500 where applicable
- successful DOCX/XLSX/CSV/MD/TXT upload stores
- Rebuild/reindex tests:
- normalized content regenerated
- chunks rebuilt with hierarchy metadata
- Retrieval tests:
- related chunk lookup still works with enriched metadata
Frontend tests
Only if the UI surfaces normalized preview directly in this phase:
- knowledge view preview prefers normalized content from API
- no regression in upload and refresh persistence behavior
Suggested Execution Order
- Add schema fields + migration guard
- Add structured markdown renderer for current parsers
- Store normalized content on upload
- Update content preview to read normalized content first
- Enrich chunk metadata with lightweight hierarchy keys
- Integrate MinerU for PDF
- Add rebuild/backfill path
- Expand tests
Risks and Mitigations
Risk: MinerU integration complexity
Mitigation:
- isolate MinerU to PDF branch only
- keep internal ParsedDocument contract stable
Risk: markdown rendering loses structure
Mitigation:
- preserve critical structure in metadata
- use explicit block markers for page/sheet/table boundaries
Risk: broad retrieval regressions
Mitigation:
- keep chunking source node-based initially
- change one layer at a time
Risk: old documents lack normalized content
Mitigation:
- lazy backfill during preview/reindex
Deliverable Recommendation
Implement in small PR-sized slices:
- schema + normalized renderer + preview fallback
- hierarchy metadata enrichment
- MinerU PDF integration
- rebuild/backfill tooling