Files
JARVIS/docs/superpowers/plans/2026-03-20-knowledge-ingestion-normalization-plan.md

211 lines
7.8 KiB
Markdown
Raw Normal View History

# Knowledge Ingestion Normalization Plan
## Goal
Introduce a unified structured-markdown ingestion pipeline for the knowledge center: MinerU for PDF, existing parsers for DOCX/XLSX/CSV/MD/TXT, persisted normalized content, and lightweight hierarchical chunk semantics.
## Scope
- Backend document parsing and normalization flow
- Document persistence model updates
- Incremental retrieval/indexing integration
- Backfill/reindex strategy for existing documents
- Test strategy for parser, router, and migration behavior
## Non-Goals
- Full parent-child chunk graph tables in this phase
- Rewriting all chunking logic to markdown-first immediately
- Replacing all non-PDF parsers with a new framework
- Solving every OCR/image-understanding case in the first pass
## Architecture Decisions
- **PDF parser:** MinerU
- **Other parsers:** keep current implementations for DOCX/XLSX/CSV/MD/TXT
- **Canonical intermediate representation:** `ParsedDocument + structured_markdown`
- **Canonical persisted content:** add `normalized_content` to `documents`
- **Hierarchy model:** metadata-based lightweight semantics, not hard foreign-key parent-child chunk tables
- **Migration strategy:** additive schema change + on-demand rebuild/reindex
## Target Flow
1. Upload file
2. Parse by type
- PDF -> MinerU -> normalize to ParsedDocument
- Other formats -> current parser -> ParsedDocument
3. Render `ParsedDocument` into `structured_markdown`
4. Persist document record including `normalized_content`
5. Build chunks (initially still from nodes, enriched with lightweight hierarchy metadata)
6. Index into vector store
7. Serve preview from `normalized_content`
## Data Model Changes
### documents table
Add fields:
- `normalized_content TEXT NULL`
- `normalized_format VARCHAR(50) NULL` (value like `structured_markdown`)
- optional later: `normalization_version VARCHAR(50) NULL`
### document_chunks metadata
Enrich chunk metadata with lightweight hierarchy keys:
- `chunk_level`
- `parent_key`
- `block_key`
- existing structural metadata remains (`section_path`, `section_title`, `page_number`, `sheet_name`, `row_start`, `row_end`, `content_type`)
Rationale:
- Supports grouped retrieval and contextual reconstruction
- Avoids introducing a relational chunk tree prematurely
## Backend Implementation Steps
### Phase 1: Schema and persistence
Files:
- `backend/app/models/document.py`
- `backend/app/database.py`
- `backend/app/schemas/document.py`
- tests under `backend/tests/backend/app`
Changes:
- Add `normalized_content` and `normalized_format` to `Document`
- Extend `ensure_document_columns()` to backfill the new columns for existing databases
- Expose `normalized_content` only where needed for preview/read APIs (avoid broad API expansion if not required yet)
### Phase 2: Introduce structured markdown renderer
Files:
- `backend/app/services/document_service.py`
- possibly a new helper module if the renderer gets too large, but prefer keeping it local initially
Changes:
- Add `_render_structured_markdown(parsed: ParsedDocument) -> str`
- Keep current per-format parsing functions
- After parsing, render once and store into `document.normalized_content`
- Add `normalized_format='structured_markdown'`
Rendering guidance:
- headings -> markdown headings
- paragraphs/text -> plain markdown paragraphs
- CSV/XLSX tables -> markdown table blocks or fenced structured table blocks when tables are too large/wide
- PDF page boundaries -> explicit page markers
- preserve contextual markers in metadata even if markdown cannot express everything perfectly
### Phase 3: MinerU integration for PDF
Files:
- `backend/app/services/document_service.py`
- `backend/pyproject.toml` / lockfile if dependencies are added
- config if MinerU requires configurable paths/options
Changes:
- Replace PDF branch with MinerU-backed parsing
- Map MinerU output into internal `ParsedNode`/`ParsedDocument`
- Preserve page and block order
- Represent image blocks as markdown placeholders plus metadata
Image policy:
- First pass: extract image block references, page number, nearby text, and optional captions
- Do not perform full image understanding for every image in phase 1
- Design metadata so high-value image understanding can be added later
### Phase 4: Chunk metadata enrichment
Files:
- `backend/app/services/document_service.py`
- `backend/app/services/knowledge_service.py`
- tests
Changes:
- Extend `_build_chunks()` to include lightweight hierarchy metadata:
- section headings become natural parent keys
- row batches / sheet blocks get stable block keys
- PDF page/section blocks preserve ordered grouping
- Keep current retrieval behavior, but let `_get_related_chunks()` benefit from richer metadata if helpful
### Phase 5: Preview and rebuild behavior
Files:
- `backend/app/routers/document.py`
- `backend/app/services/document_service.py`
Changes:
- `get_document_content()` should prefer `normalized_content`
- Fallback to legacy file reading only when normalized content is absent
- Rebuild/reindex paths should regenerate normalized content before chunk rebuild/indexing
### Phase 6: Backfill strategy
Approach:
- Add a rebuild endpoint or reuse existing reindex flow to backfill `normalized_content`
- Existing documents can be migrated lazily:
- when opened
- when reindexed
- or via an admin/batch rebuild command later
This avoids a risky one-shot migration.
## Error Handling Changes
Current issue:
- Upload route can leak parser/dependency problems as generic 500s.
Changes:
- Convert expected parser/business errors to explicit 4xx responses where appropriate
- For missing optional parser dependencies, return clear messages such as:
- `DOCX parsing dependency missing: python-docx`
- `PDF parsing dependency missing/configuration invalid`
- Keep true unexpected exceptions as 500s
Files:
- `backend/app/routers/document.py`
- `backend/app/services/document_service.py`
## Testing Plan
### Backend unit/integration tests
1. Schema migration test for new `documents` columns
2. Renderer tests:
- markdown headings preserved
- section paths retained in metadata
- xlsx/csv table blocks rendered predictably
- pdf page markers preserved from MinerU mapping
3. Upload tests:
- successful DOCX/XLSX/CSV/MD/TXT upload stores `normalized_content`
- PDF upload stores `normalized_content`
- missing dependency returns clear error instead of generic 500 where applicable
4. Rebuild/reindex tests:
- normalized content regenerated
- chunks rebuilt with hierarchy metadata
5. Retrieval tests:
- related chunk lookup still works with enriched metadata
### Frontend tests
Only if the UI surfaces normalized preview directly in this phase:
- knowledge view preview prefers normalized content from API
- no regression in upload and refresh persistence behavior
## Suggested Execution Order
1. Add schema fields + migration guard
2. Add structured markdown renderer for current parsers
3. Store normalized content on upload
4. Update content preview to read normalized content first
5. Enrich chunk metadata with lightweight hierarchy keys
6. Integrate MinerU for PDF
7. Add rebuild/backfill path
8. Expand tests
## Risks and Mitigations
### Risk: MinerU integration complexity
Mitigation:
- isolate MinerU to PDF branch only
- keep internal ParsedDocument contract stable
### Risk: markdown rendering loses structure
Mitigation:
- preserve critical structure in metadata
- use explicit block markers for page/sheet/table boundaries
### Risk: broad retrieval regressions
Mitigation:
- keep chunking source node-based initially
- change one layer at a time
### Risk: old documents lack normalized content
Mitigation:
- lazy backfill during preview/reindex
## Deliverable Recommendation
Implement in small PR-sized slices:
1. schema + normalized renderer + preview fallback
2. hierarchy metadata enrichment
3. MinerU PDF integration
4. rebuild/backfill tooling