211 lines
7.8 KiB
Markdown
211 lines
7.8 KiB
Markdown
|
|
# Knowledge Ingestion Normalization Plan
|
||
|
|
|
||
|
|
## Goal
|
||
|
|
Introduce a unified structured-markdown ingestion pipeline for the knowledge center: MinerU for PDF, existing parsers for DOCX/XLSX/CSV/MD/TXT, persisted normalized content, and lightweight hierarchical chunk semantics.
|
||
|
|
|
||
|
|
## Scope
|
||
|
|
- Backend document parsing and normalization flow
|
||
|
|
- Document persistence model updates
|
||
|
|
- Incremental retrieval/indexing integration
|
||
|
|
- Backfill/reindex strategy for existing documents
|
||
|
|
- Test strategy for parser, router, and migration behavior
|
||
|
|
|
||
|
|
## Non-Goals
|
||
|
|
- Full parent-child chunk graph tables in this phase
|
||
|
|
- Rewriting all chunking logic to markdown-first immediately
|
||
|
|
- Replacing all non-PDF parsers with a new framework
|
||
|
|
- Solving every OCR/image-understanding case in the first pass
|
||
|
|
|
||
|
|
## Architecture Decisions
|
||
|
|
- **PDF parser:** MinerU
|
||
|
|
- **Other parsers:** keep current implementations for DOCX/XLSX/CSV/MD/TXT
|
||
|
|
- **Canonical intermediate representation:** `ParsedDocument + structured_markdown`
|
||
|
|
- **Canonical persisted content:** add `normalized_content` to `documents`
|
||
|
|
- **Hierarchy model:** metadata-based lightweight semantics, not hard foreign-key parent-child chunk tables
|
||
|
|
- **Migration strategy:** additive schema change + on-demand rebuild/reindex
|
||
|
|
|
||
|
|
## Target Flow
|
||
|
|
1. Upload file
|
||
|
|
2. Parse by type
|
||
|
|
- PDF -> MinerU -> normalize to ParsedDocument
|
||
|
|
- Other formats -> current parser -> ParsedDocument
|
||
|
|
3. Render `ParsedDocument` into `structured_markdown`
|
||
|
|
4. Persist document record including `normalized_content`
|
||
|
|
5. Build chunks (initially still from nodes, enriched with lightweight hierarchy metadata)
|
||
|
|
6. Index into vector store
|
||
|
|
7. Serve preview from `normalized_content`
|
||
|
|
|
||
|
|
## Data Model Changes
|
||
|
|
### documents table
|
||
|
|
Add fields:
|
||
|
|
- `normalized_content TEXT NULL`
|
||
|
|
- `normalized_format VARCHAR(50) NULL` (value like `structured_markdown`)
|
||
|
|
- optional later: `normalization_version VARCHAR(50) NULL`
|
||
|
|
|
||
|
|
### document_chunks metadata
|
||
|
|
Enrich chunk metadata with lightweight hierarchy keys:
|
||
|
|
- `chunk_level`
|
||
|
|
- `parent_key`
|
||
|
|
- `block_key`
|
||
|
|
- existing structural metadata remains (`section_path`, `section_title`, `page_number`, `sheet_name`, `row_start`, `row_end`, `content_type`)
|
||
|
|
|
||
|
|
Rationale:
|
||
|
|
- Supports grouped retrieval and contextual reconstruction
|
||
|
|
- Avoids introducing a relational chunk tree prematurely
|
||
|
|
|
||
|
|
## Backend Implementation Steps
|
||
|
|
### Phase 1: Schema and persistence
|
||
|
|
Files:
|
||
|
|
- `backend/app/models/document.py`
|
||
|
|
- `backend/app/database.py`
|
||
|
|
- `backend/app/schemas/document.py`
|
||
|
|
- tests under `backend/tests/backend/app`
|
||
|
|
|
||
|
|
Changes:
|
||
|
|
- Add `normalized_content` and `normalized_format` to `Document`
|
||
|
|
- Extend `ensure_document_columns()` to backfill the new columns for existing databases
|
||
|
|
- Expose `normalized_content` only where needed for preview/read APIs (avoid broad API expansion if not required yet)
|
||
|
|
|
||
|
|
### Phase 2: Introduce structured markdown renderer
|
||
|
|
Files:
|
||
|
|
- `backend/app/services/document_service.py`
|
||
|
|
- possibly a new helper module if the renderer gets too large, but prefer keeping it local initially
|
||
|
|
|
||
|
|
Changes:
|
||
|
|
- Add `_render_structured_markdown(parsed: ParsedDocument) -> str`
|
||
|
|
- Keep current per-format parsing functions
|
||
|
|
- After parsing, render once and store into `document.normalized_content`
|
||
|
|
- Add `normalized_format='structured_markdown'`
|
||
|
|
|
||
|
|
Rendering guidance:
|
||
|
|
- headings -> markdown headings
|
||
|
|
- paragraphs/text -> plain markdown paragraphs
|
||
|
|
- CSV/XLSX tables -> markdown table blocks or fenced structured table blocks when tables are too large/wide
|
||
|
|
- PDF page boundaries -> explicit page markers
|
||
|
|
- preserve contextual markers in metadata even if markdown cannot express everything perfectly
|
||
|
|
|
||
|
|
### Phase 3: MinerU integration for PDF
|
||
|
|
Files:
|
||
|
|
- `backend/app/services/document_service.py`
|
||
|
|
- `backend/pyproject.toml` / lockfile if dependencies are added
|
||
|
|
- config if MinerU requires configurable paths/options
|
||
|
|
|
||
|
|
Changes:
|
||
|
|
- Replace PDF branch with MinerU-backed parsing
|
||
|
|
- Map MinerU output into internal `ParsedNode`/`ParsedDocument`
|
||
|
|
- Preserve page and block order
|
||
|
|
- Represent image blocks as markdown placeholders plus metadata
|
||
|
|
|
||
|
|
Image policy:
|
||
|
|
- First pass: extract image block references, page number, nearby text, and optional captions
|
||
|
|
- Do not perform full image understanding for every image in phase 1
|
||
|
|
- Design metadata so high-value image understanding can be added later
|
||
|
|
|
||
|
|
### Phase 4: Chunk metadata enrichment
|
||
|
|
Files:
|
||
|
|
- `backend/app/services/document_service.py`
|
||
|
|
- `backend/app/services/knowledge_service.py`
|
||
|
|
- tests
|
||
|
|
|
||
|
|
Changes:
|
||
|
|
- Extend `_build_chunks()` to include lightweight hierarchy metadata:
|
||
|
|
- section headings become natural parent keys
|
||
|
|
- row batches / sheet blocks get stable block keys
|
||
|
|
- PDF page/section blocks preserve ordered grouping
|
||
|
|
- Keep current retrieval behavior, but let `_get_related_chunks()` benefit from richer metadata if helpful
|
||
|
|
|
||
|
|
### Phase 5: Preview and rebuild behavior
|
||
|
|
Files:
|
||
|
|
- `backend/app/routers/document.py`
|
||
|
|
- `backend/app/services/document_service.py`
|
||
|
|
|
||
|
|
Changes:
|
||
|
|
- `get_document_content()` should prefer `normalized_content`
|
||
|
|
- Fallback to legacy file reading only when normalized content is absent
|
||
|
|
- Rebuild/reindex paths should regenerate normalized content before chunk rebuild/indexing
|
||
|
|
|
||
|
|
### Phase 6: Backfill strategy
|
||
|
|
Approach:
|
||
|
|
- Add a rebuild endpoint or reuse existing reindex flow to backfill `normalized_content`
|
||
|
|
- Existing documents can be migrated lazily:
|
||
|
|
- when opened
|
||
|
|
- when reindexed
|
||
|
|
- or via an admin/batch rebuild command later
|
||
|
|
|
||
|
|
This avoids a risky one-shot migration.
|
||
|
|
|
||
|
|
## Error Handling Changes
|
||
|
|
Current issue:
|
||
|
|
- Upload route can leak parser/dependency problems as generic 500s.
|
||
|
|
|
||
|
|
Changes:
|
||
|
|
- Convert expected parser/business errors to explicit 4xx responses where appropriate
|
||
|
|
- For missing optional parser dependencies, return clear messages such as:
|
||
|
|
- `DOCX parsing dependency missing: python-docx`
|
||
|
|
- `PDF parsing dependency missing/configuration invalid`
|
||
|
|
- Keep true unexpected exceptions as 500s
|
||
|
|
|
||
|
|
Files:
|
||
|
|
- `backend/app/routers/document.py`
|
||
|
|
- `backend/app/services/document_service.py`
|
||
|
|
|
||
|
|
## Testing Plan
|
||
|
|
### Backend unit/integration tests
|
||
|
|
1. Schema migration test for new `documents` columns
|
||
|
|
2. Renderer tests:
|
||
|
|
- markdown headings preserved
|
||
|
|
- section paths retained in metadata
|
||
|
|
- xlsx/csv table blocks rendered predictably
|
||
|
|
- pdf page markers preserved from MinerU mapping
|
||
|
|
3. Upload tests:
|
||
|
|
- successful DOCX/XLSX/CSV/MD/TXT upload stores `normalized_content`
|
||
|
|
- PDF upload stores `normalized_content`
|
||
|
|
- missing dependency returns clear error instead of generic 500 where applicable
|
||
|
|
4. Rebuild/reindex tests:
|
||
|
|
- normalized content regenerated
|
||
|
|
- chunks rebuilt with hierarchy metadata
|
||
|
|
5. Retrieval tests:
|
||
|
|
- related chunk lookup still works with enriched metadata
|
||
|
|
|
||
|
|
### Frontend tests
|
||
|
|
Only if the UI surfaces normalized preview directly in this phase:
|
||
|
|
- knowledge view preview prefers normalized content from API
|
||
|
|
- no regression in upload and refresh persistence behavior
|
||
|
|
|
||
|
|
## Suggested Execution Order
|
||
|
|
1. Add schema fields + migration guard
|
||
|
|
2. Add structured markdown renderer for current parsers
|
||
|
|
3. Store normalized content on upload
|
||
|
|
4. Update content preview to read normalized content first
|
||
|
|
5. Enrich chunk metadata with lightweight hierarchy keys
|
||
|
|
6. Integrate MinerU for PDF
|
||
|
|
7. Add rebuild/backfill path
|
||
|
|
8. Expand tests
|
||
|
|
|
||
|
|
## Risks and Mitigations
|
||
|
|
### Risk: MinerU integration complexity
|
||
|
|
Mitigation:
|
||
|
|
- isolate MinerU to PDF branch only
|
||
|
|
- keep internal ParsedDocument contract stable
|
||
|
|
|
||
|
|
### Risk: markdown rendering loses structure
|
||
|
|
Mitigation:
|
||
|
|
- preserve critical structure in metadata
|
||
|
|
- use explicit block markers for page/sheet/table boundaries
|
||
|
|
|
||
|
|
### Risk: broad retrieval regressions
|
||
|
|
Mitigation:
|
||
|
|
- keep chunking source node-based initially
|
||
|
|
- change one layer at a time
|
||
|
|
|
||
|
|
### Risk: old documents lack normalized content
|
||
|
|
Mitigation:
|
||
|
|
- lazy backfill during preview/reindex
|
||
|
|
|
||
|
|
## Deliverable Recommendation
|
||
|
|
Implement in small PR-sized slices:
|
||
|
|
1. schema + normalized renderer + preview fallback
|
||
|
|
2. hierarchy metadata enrichment
|
||
|
|
3. MinerU PDF integration
|
||
|
|
4. rebuild/backfill tooling
|