Files
JARVIS/docs/superpowers/plans/2026-03-20-knowledge-ingestion-normalization-plan.md
DESKTOP-72TV0V4\caoxiaozhu 6f594631e9 Refine knowledge brain workflow
Align the brain prompts, graph view, and startup defaults with the
latest phase 1 flow so local runs and navigation stay consistent.
2026-03-22 22:42:47 +08:00

7.8 KiB

Knowledge Ingestion Normalization Plan

Goal

Introduce a unified structured-markdown ingestion pipeline for the knowledge center: MinerU for PDF, existing parsers for DOCX/XLSX/CSV/MD/TXT, persisted normalized content, and lightweight hierarchical chunk semantics.

Scope

  • Backend document parsing and normalization flow
  • Document persistence model updates
  • Incremental retrieval/indexing integration
  • Backfill/reindex strategy for existing documents
  • Test strategy for parser, router, and migration behavior

Non-Goals

  • Full parent-child chunk graph tables in this phase
  • Rewriting all chunking logic to markdown-first immediately
  • Replacing all non-PDF parsers with a new framework
  • Solving every OCR/image-understanding case in the first pass

Architecture Decisions

  • PDF parser: MinerU
  • Other parsers: keep current implementations for DOCX/XLSX/CSV/MD/TXT
  • Canonical intermediate representation: ParsedDocument + structured_markdown
  • Canonical persisted content: add normalized_content to documents
  • Hierarchy model: metadata-based lightweight semantics, not hard foreign-key parent-child chunk tables
  • Migration strategy: additive schema change + on-demand rebuild/reindex

Target Flow

  1. Upload file
  2. Parse by type
    • PDF -> MinerU -> normalize to ParsedDocument
    • Other formats -> current parser -> ParsedDocument
  3. Render ParsedDocument into structured_markdown
  4. Persist document record including normalized_content
  5. Build chunks (initially still from nodes, enriched with lightweight hierarchy metadata)
  6. Index into vector store
  7. Serve preview from normalized_content

Data Model Changes

documents table

Add fields:

  • normalized_content TEXT NULL
  • normalized_format VARCHAR(50) NULL (value like structured_markdown)
  • optional later: normalization_version VARCHAR(50) NULL

document_chunks metadata

Enrich chunk metadata with lightweight hierarchy keys:

  • chunk_level
  • parent_key
  • block_key
  • existing structural metadata remains (section_path, section_title, page_number, sheet_name, row_start, row_end, content_type)

Rationale:

  • Supports grouped retrieval and contextual reconstruction
  • Avoids introducing a relational chunk tree prematurely

Backend Implementation Steps

Phase 1: Schema and persistence

Files:

  • backend/app/models/document.py
  • backend/app/database.py
  • backend/app/schemas/document.py
  • tests under backend/tests/backend/app

Changes:

  • Add normalized_content and normalized_format to Document
  • Extend ensure_document_columns() to backfill the new columns for existing databases
  • Expose normalized_content only where needed for preview/read APIs (avoid broad API expansion if not required yet)

Phase 2: Introduce structured markdown renderer

Files:

  • backend/app/services/document_service.py
  • possibly a new helper module if the renderer gets too large, but prefer keeping it local initially

Changes:

  • Add _render_structured_markdown(parsed: ParsedDocument) -> str
  • Keep current per-format parsing functions
  • After parsing, render once and store into document.normalized_content
  • Add normalized_format='structured_markdown'

Rendering guidance:

  • headings -> markdown headings
  • paragraphs/text -> plain markdown paragraphs
  • CSV/XLSX tables -> markdown table blocks or fenced structured table blocks when tables are too large/wide
  • PDF page boundaries -> explicit page markers
  • preserve contextual markers in metadata even if markdown cannot express everything perfectly

Phase 3: MinerU integration for PDF

Files:

  • backend/app/services/document_service.py
  • backend/pyproject.toml / lockfile if dependencies are added
  • config if MinerU requires configurable paths/options

Changes:

  • Replace PDF branch with MinerU-backed parsing
  • Map MinerU output into internal ParsedNode/ParsedDocument
  • Preserve page and block order
  • Represent image blocks as markdown placeholders plus metadata

Image policy:

  • First pass: extract image block references, page number, nearby text, and optional captions
  • Do not perform full image understanding for every image in phase 1
  • Design metadata so high-value image understanding can be added later

Phase 4: Chunk metadata enrichment

Files:

  • backend/app/services/document_service.py
  • backend/app/services/knowledge_service.py
  • tests

Changes:

  • Extend _build_chunks() to include lightweight hierarchy metadata:
    • section headings become natural parent keys
    • row batches / sheet blocks get stable block keys
    • PDF page/section blocks preserve ordered grouping
  • Keep current retrieval behavior, but let _get_related_chunks() benefit from richer metadata if helpful

Phase 5: Preview and rebuild behavior

Files:

  • backend/app/routers/document.py
  • backend/app/services/document_service.py

Changes:

  • get_document_content() should prefer normalized_content
  • Fallback to legacy file reading only when normalized content is absent
  • Rebuild/reindex paths should regenerate normalized content before chunk rebuild/indexing

Phase 6: Backfill strategy

Approach:

  • Add a rebuild endpoint or reuse existing reindex flow to backfill normalized_content
  • Existing documents can be migrated lazily:
    • when opened
    • when reindexed
    • or via an admin/batch rebuild command later

This avoids a risky one-shot migration.

Error Handling Changes

Current issue:

  • Upload route can leak parser/dependency problems as generic 500s.

Changes:

  • Convert expected parser/business errors to explicit 4xx responses where appropriate
  • For missing optional parser dependencies, return clear messages such as:
    • DOCX parsing dependency missing: python-docx
    • PDF parsing dependency missing/configuration invalid
  • Keep true unexpected exceptions as 500s

Files:

  • backend/app/routers/document.py
  • backend/app/services/document_service.py

Testing Plan

Backend unit/integration tests

  1. Schema migration test for new documents columns
  2. Renderer tests:
    • markdown headings preserved
    • section paths retained in metadata
    • xlsx/csv table blocks rendered predictably
    • pdf page markers preserved from MinerU mapping
  3. Upload tests:
    • successful DOCX/XLSX/CSV/MD/TXT upload stores normalized_content
    • PDF upload stores normalized_content
    • missing dependency returns clear error instead of generic 500 where applicable
  4. Rebuild/reindex tests:
    • normalized content regenerated
    • chunks rebuilt with hierarchy metadata
  5. Retrieval tests:
    • related chunk lookup still works with enriched metadata

Frontend tests

Only if the UI surfaces normalized preview directly in this phase:

  • knowledge view preview prefers normalized content from API
  • no regression in upload and refresh persistence behavior

Suggested Execution Order

  1. Add schema fields + migration guard
  2. Add structured markdown renderer for current parsers
  3. Store normalized content on upload
  4. Update content preview to read normalized content first
  5. Enrich chunk metadata with lightweight hierarchy keys
  6. Integrate MinerU for PDF
  7. Add rebuild/backfill path
  8. Expand tests

Risks and Mitigations

Risk: MinerU integration complexity

Mitigation:

  • isolate MinerU to PDF branch only
  • keep internal ParsedDocument contract stable

Risk: markdown rendering loses structure

Mitigation:

  • preserve critical structure in metadata
  • use explicit block markers for page/sheet/table boundaries

Risk: broad retrieval regressions

Mitigation:

  • keep chunking source node-based initially
  • change one layer at a time

Risk: old documents lack normalized content

Mitigation:

  • lazy backfill during preview/reindex

Deliverable Recommendation

Implement in small PR-sized slices:

  1. schema + normalized renderer + preview fallback
  2. hierarchy metadata enrichment
  3. MinerU PDF integration
  4. rebuild/backfill tooling