Refine knowledge brain workflow

Align the brain prompts, graph view, and startup defaults with the latest phase 1 flow so local runs and navigation stay consistent.
2026-03-22 22:42:47 +08:00
parent 67ea3d2682
commit 6f594631e9
23 changed files with 1508 additions and 526 deletions
--- a/docs/superpowers/plans/2026-03-20-knowledge-ingestion-normalization-plan.md
+++ b/docs/superpowers/plans/2026-03-20-knowledge-ingestion-normalization-plan.md
@@ -0,0 +1,210 @@
+# Knowledge Ingestion Normalization Plan
+
+## Goal
+Introduce a unified structured-markdown ingestion pipeline for the knowledge center: MinerU for PDF, existing parsers for DOCX/XLSX/CSV/MD/TXT, persisted normalized content, and lightweight hierarchical chunk semantics.
+
+## Scope
+- Backend document parsing and normalization flow
+- Document persistence model updates
+- Incremental retrieval/indexing integration
+- Backfill/reindex strategy for existing documents
+- Test strategy for parser, router, and migration behavior
+
+## Non-Goals
+- Full parent-child chunk graph tables in this phase
+- Rewriting all chunking logic to markdown-first immediately
+- Replacing all non-PDF parsers with a new framework
+- Solving every OCR/image-understanding case in the first pass
+
+## Architecture Decisions
+- **PDF parser:** MinerU
+- **Other parsers:** keep current implementations for DOCX/XLSX/CSV/MD/TXT
+- **Canonical intermediate representation:** `ParsedDocument + structured_markdown`
+- **Canonical persisted content:** add `normalized_content` to `documents`
+- **Hierarchy model:** metadata-based lightweight semantics, not hard foreign-key parent-child chunk tables
+- **Migration strategy:** additive schema change + on-demand rebuild/reindex
+
+## Target Flow
+1. Upload file
+2. Parse by type
+   - PDF -> MinerU -> normalize to ParsedDocument
+   - Other formats -> current parser -> ParsedDocument
+3. Render `ParsedDocument` into `structured_markdown`
+4. Persist document record including `normalized_content`
+5. Build chunks (initially still from nodes, enriched with lightweight hierarchy metadata)
+6. Index into vector store
+7. Serve preview from `normalized_content`
+
+## Data Model Changes
+### documents table
+Add fields:
+- `normalized_content TEXT NULL`
+- `normalized_format VARCHAR(50) NULL` (value like `structured_markdown`)
+- optional later: `normalization_version VARCHAR(50) NULL`
+
+### document_chunks metadata
+Enrich chunk metadata with lightweight hierarchy keys:
+- `chunk_level`
+- `parent_key`
+- `block_key`
+- existing structural metadata remains (`section_path`, `section_title`, `page_number`, `sheet_name`, `row_start`, `row_end`, `content_type`)
+
+Rationale:
+- Supports grouped retrieval and contextual reconstruction
+- Avoids introducing a relational chunk tree prematurely
+
+## Backend Implementation Steps
+### Phase 1: Schema and persistence
+Files:
+- `backend/app/models/document.py`
+- `backend/app/database.py`
+- `backend/app/schemas/document.py`
+- tests under `backend/tests/backend/app`
+
+Changes:
+- Add `normalized_content` and `normalized_format` to `Document`
+- Extend `ensure_document_columns()` to backfill the new columns for existing databases
+- Expose `normalized_content` only where needed for preview/read APIs (avoid broad API expansion if not required yet)
+
+### Phase 2: Introduce structured markdown renderer
+Files:
+- `backend/app/services/document_service.py`
+- possibly a new helper module if the renderer gets too large, but prefer keeping it local initially
+
+Changes:
+- Add `_render_structured_markdown(parsed: ParsedDocument) -> str`
+- Keep current per-format parsing functions
+- After parsing, render once and store into `document.normalized_content`
+- Add `normalized_format='structured_markdown'`
+
+Rendering guidance:
+- headings -> markdown headings
+- paragraphs/text -> plain markdown paragraphs
+- CSV/XLSX tables -> markdown table blocks or fenced structured table blocks when tables are too large/wide
+- PDF page boundaries -> explicit page markers
+- preserve contextual markers in metadata even if markdown cannot express everything perfectly
+
+### Phase 3: MinerU integration for PDF
+Files:
+- `backend/app/services/document_service.py`
+- `backend/pyproject.toml` / lockfile if dependencies are added
+- config if MinerU requires configurable paths/options
+
+Changes:
+- Replace PDF branch with MinerU-backed parsing
+- Map MinerU output into internal `ParsedNode`/`ParsedDocument`
+- Preserve page and block order
+- Represent image blocks as markdown placeholders plus metadata
+
+Image policy:
+- First pass: extract image block references, page number, nearby text, and optional captions
+- Do not perform full image understanding for every image in phase 1
+- Design metadata so high-value image understanding can be added later
+
+### Phase 4: Chunk metadata enrichment
+Files:
+- `backend/app/services/document_service.py`
+- `backend/app/services/knowledge_service.py`
+- tests
+
+Changes:
+- Extend `_build_chunks()` to include lightweight hierarchy metadata:
+  - section headings become natural parent keys
+  - row batches / sheet blocks get stable block keys
+  - PDF page/section blocks preserve ordered grouping
+- Keep current retrieval behavior, but let `_get_related_chunks()` benefit from richer metadata if helpful
+
+### Phase 5: Preview and rebuild behavior
+Files:
+- `backend/app/routers/document.py`
+- `backend/app/services/document_service.py`
+
+Changes:
+- `get_document_content()` should prefer `normalized_content`
+- Fallback to legacy file reading only when normalized content is absent
+- Rebuild/reindex paths should regenerate normalized content before chunk rebuild/indexing
+
+### Phase 6: Backfill strategy
+Approach:
+- Add a rebuild endpoint or reuse existing reindex flow to backfill `normalized_content`
+- Existing documents can be migrated lazily:
+  - when opened
+  - when reindexed
+  - or via an admin/batch rebuild command later
+
+This avoids a risky one-shot migration.
+
+## Error Handling Changes
+Current issue:
+- Upload route can leak parser/dependency problems as generic 500s.
+
+Changes:
+- Convert expected parser/business errors to explicit 4xx responses where appropriate
+- For missing optional parser dependencies, return clear messages such as:
+  - `DOCX parsing dependency missing: python-docx`
+  - `PDF parsing dependency missing/configuration invalid`
+- Keep true unexpected exceptions as 500s
+
+Files:
+- `backend/app/routers/document.py`
+- `backend/app/services/document_service.py`
+
+## Testing Plan
+### Backend unit/integration tests
+1. Schema migration test for new `documents` columns
+2. Renderer tests:
+   - markdown headings preserved
+   - section paths retained in metadata
+   - xlsx/csv table blocks rendered predictably
+   - pdf page markers preserved from MinerU mapping
+3. Upload tests:
+   - successful DOCX/XLSX/CSV/MD/TXT upload stores `normalized_content`
+   - PDF upload stores `normalized_content`
+   - missing dependency returns clear error instead of generic 500 where applicable
+4. Rebuild/reindex tests:
+   - normalized content regenerated
+   - chunks rebuilt with hierarchy metadata
+5. Retrieval tests:
+   - related chunk lookup still works with enriched metadata
+
+### Frontend tests
+Only if the UI surfaces normalized preview directly in this phase:
+- knowledge view preview prefers normalized content from API
+- no regression in upload and refresh persistence behavior
+
+## Suggested Execution Order
+1. Add schema fields + migration guard
+2. Add structured markdown renderer for current parsers
+3. Store normalized content on upload
+4. Update content preview to read normalized content first
+5. Enrich chunk metadata with lightweight hierarchy keys
+6. Integrate MinerU for PDF
+7. Add rebuild/backfill path
+8. Expand tests
+
+## Risks and Mitigations
+### Risk: MinerU integration complexity
+Mitigation:
+- isolate MinerU to PDF branch only
+- keep internal ParsedDocument contract stable
+
+### Risk: markdown rendering loses structure
+Mitigation:
+- preserve critical structure in metadata
+- use explicit block markers for page/sheet/table boundaries
+
+### Risk: broad retrieval regressions
+Mitigation:
+- keep chunking source node-based initially
+- change one layer at a time
+
+### Risk: old documents lack normalized content
+Mitigation:
+- lazy backfill during preview/reindex
+
+## Deliverable Recommendation
+Implement in small PR-sized slices:
+1. schema + normalized renderer + preview fallback
+2. hierarchy metadata enrichment
+3. MinerU PDF integration
+4. rebuild/backfill tooling