Files

DESKTOP-72TV0V4\caoxiaozhu 90ea732584 Add local project snapshots and plans

Capture the current local data snapshot and planning artifacts alongside
this development batch so the workspace state matches the code changes.
This preserves the reference materials and generated files that were
kept in the working tree.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-22 13:49:03 +08:00

7.8 KiB

Raw Blame History

Knowledge Ingestion Normalization Plan

Goal

Introduce a unified structured-markdown ingestion pipeline for the knowledge center: MinerU for PDF, existing parsers for DOCX/XLSX/CSV/MD/TXT, persisted normalized content, and lightweight hierarchical chunk semantics.

Scope

Backend document parsing and normalization flow
Document persistence model updates
Incremental retrieval/indexing integration
Backfill/reindex strategy for existing documents
Test strategy for parser, router, and migration behavior

Non-Goals

Full parent-child chunk graph tables in this phase
Rewriting all chunking logic to markdown-first immediately
Replacing all non-PDF parsers with a new framework
Solving every OCR/image-understanding case in the first pass

Architecture Decisions

PDF parser: MinerU
Other parsers: keep current implementations for DOCX/XLSX/CSV/MD/TXT
Canonical intermediate representation: ParsedDocument + structured_markdown
Canonical persisted content: add normalized_content to documents
Hierarchy model: metadata-based lightweight semantics, not hard foreign-key parent-child chunk tables
Migration strategy: additive schema change + on-demand rebuild/reindex

Target Flow

Upload file
Parse by type
- PDF -> MinerU -> normalize to ParsedDocument
- Other formats -> current parser -> ParsedDocument
Render ParsedDocument into structured_markdown
Persist document record including normalized_content
Build chunks (initially still from nodes, enriched with lightweight hierarchy metadata)
Index into vector store
Serve preview from normalized_content

Data Model Changes

documents table

Add fields:

normalized_content TEXT NULL
normalized_format VARCHAR(50) NULL (value like structured_markdown)
optional later: normalization_version VARCHAR(50) NULL

document_chunks metadata

Enrich chunk metadata with lightweight hierarchy keys:

chunk_level
parent_key
block_key
existing structural metadata remains (section_path, section_title, page_number, sheet_name, row_start, row_end, content_type)

Rationale:

Supports grouped retrieval and contextual reconstruction
Avoids introducing a relational chunk tree prematurely

Backend Implementation Steps

Phase 1: Schema and persistence

Files:

backend/app/models/document.py
backend/app/database.py
backend/app/schemas/document.py
tests under backend/tests/backend/app

Changes:

Add normalized_content and normalized_format to Document
Extend ensure_document_columns() to backfill the new columns for existing databases
Expose normalized_content only where needed for preview/read APIs (avoid broad API expansion if not required yet)

Phase 2: Introduce structured markdown renderer

Files:

backend/app/services/document_service.py
possibly a new helper module if the renderer gets too large, but prefer keeping it local initially

Changes:

Add _render_structured_markdown(parsed: ParsedDocument) -> str
Keep current per-format parsing functions
After parsing, render once and store into document.normalized_content
Add normalized_format='structured_markdown'

Rendering guidance:

headings -> markdown headings
paragraphs/text -> plain markdown paragraphs
CSV/XLSX tables -> markdown table blocks or fenced structured table blocks when tables are too large/wide
PDF page boundaries -> explicit page markers
preserve contextual markers in metadata even if markdown cannot express everything perfectly

Phase 3: MinerU integration for PDF

Files:

backend/app/services/document_service.py
backend/pyproject.toml / lockfile if dependencies are added
config if MinerU requires configurable paths/options

Changes:

Replace PDF branch with MinerU-backed parsing
Map MinerU output into internal ParsedNode/ParsedDocument
Preserve page and block order
Represent image blocks as markdown placeholders plus metadata

Image policy:

First pass: extract image block references, page number, nearby text, and optional captions
Do not perform full image understanding for every image in phase 1
Design metadata so high-value image understanding can be added later

Phase 4: Chunk metadata enrichment

Files:

backend/app/services/document_service.py
backend/app/services/knowledge_service.py
tests

Changes:

Extend _build_chunks() to include lightweight hierarchy metadata:
- section headings become natural parent keys
- row batches / sheet blocks get stable block keys
- PDF page/section blocks preserve ordered grouping
Keep current retrieval behavior, but let _get_related_chunks() benefit from richer metadata if helpful

Phase 5: Preview and rebuild behavior

Files:

backend/app/routers/document.py
backend/app/services/document_service.py

Changes:

get_document_content() should prefer normalized_content
Fallback to legacy file reading only when normalized content is absent
Rebuild/reindex paths should regenerate normalized content before chunk rebuild/indexing

Phase 6: Backfill strategy

Approach:

Add a rebuild endpoint or reuse existing reindex flow to backfill normalized_content
Existing documents can be migrated lazily:
- when opened
- when reindexed
- or via an admin/batch rebuild command later

This avoids a risky one-shot migration.

Error Handling Changes

Current issue:

Upload route can leak parser/dependency problems as generic 500s.

Changes:

Convert expected parser/business errors to explicit 4xx responses where appropriate
For missing optional parser dependencies, return clear messages such as:
- DOCX parsing dependency missing: python-docx
- PDF parsing dependency missing/configuration invalid
Keep true unexpected exceptions as 500s

Files:

backend/app/routers/document.py
backend/app/services/document_service.py

Testing Plan

Backend unit/integration tests

Schema migration test for new documents columns
Renderer tests:
- markdown headings preserved
- section paths retained in metadata
- xlsx/csv table blocks rendered predictably
- pdf page markers preserved from MinerU mapping
Upload tests:
- successful DOCX/XLSX/CSV/MD/TXT upload stores normalized_content
- PDF upload stores normalized_content
- missing dependency returns clear error instead of generic 500 where applicable
Rebuild/reindex tests:
- normalized content regenerated
- chunks rebuilt with hierarchy metadata
Retrieval tests:
- related chunk lookup still works with enriched metadata

Frontend tests

Only if the UI surfaces normalized preview directly in this phase:

knowledge view preview prefers normalized content from API
no regression in upload and refresh persistence behavior

Suggested Execution Order

Add schema fields + migration guard
Add structured markdown renderer for current parsers
Store normalized content on upload
Update content preview to read normalized content first
Enrich chunk metadata with lightweight hierarchy keys
Integrate MinerU for PDF
Add rebuild/backfill path
Expand tests

Risks and Mitigations

Risk: MinerU integration complexity

Mitigation:

isolate MinerU to PDF branch only
keep internal ParsedDocument contract stable

Risk: markdown rendering loses structure

Mitigation:

preserve critical structure in metadata
use explicit block markers for page/sheet/table boundaries

Risk: broad retrieval regressions

Mitigation:

keep chunking source node-based initially
change one layer at a time

Risk: old documents lack normalized content

Mitigation:

lazy backfill during preview/reindex

Deliverable Recommendation

Implement in small PR-sized slices:

schema + normalized renderer + preview fallback
hierarchy metadata enrichment
MinerU PDF integration
rebuild/backfill tooling

7.8 KiB Raw Blame History

Knowledge Ingestion Normalization Plan

Goal

Scope

Non-Goals

Architecture Decisions

Target Flow

Data Model Changes

documents table

document_chunks metadata

Backend Implementation Steps

Phase 1: Schema and persistence

Phase 2: Introduce structured markdown renderer

Phase 3: MinerU integration for PDF

Phase 4: Chunk metadata enrichment

Phase 5: Preview and rebuild behavior

Phase 6: Backfill strategy

Error Handling Changes

Testing Plan

Backend unit/integration tests

Frontend tests

Suggested Execution Order

Risks and Mitigations

Risk: MinerU integration complexity

Risk: markdown rendering loses structure

Risk: broad retrieval regressions

Risk: old documents lack normalized content

Deliverable Recommendation

7.8 KiB

Raw Blame History