feat(ocr): PDF 文本层可用时跳过 worker 调用并补装 poppler-data
- OcrService 提取 PDF 文本层后若有效字符达到阈值,直接构建文档并写入结果缓存,不再触发 OCR worker,仅无文本层时才解析 python_bin/worker_path 调用 worker - _build_text_layer_document 复用 AggregatedOcrDocument 聚合文本层片段,_has_usable_pdf_text_layer 基于 meaningful_char_count 判定 - docker-compose 与 paddleocr bootstrap 脚本补装 poppler-data,保证 PDF 文本层抽取的中文编码正确 - 新增文本层直取与运行时依赖两项 ocr_service 单测
This commit is contained in:
@@ -32,7 +32,7 @@ services:
|
||||
- >
|
||||
apt-get update &&
|
||||
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends
|
||||
python3 python3-pip python3-venv fontconfig openssh-server &&
|
||||
python3 python3-pip python3-venv fontconfig openssh-server poppler-data &&
|
||||
if ! fc-match 'Noto Sans CJK SC' | grep -qi 'Noto'; then if ! timeout "${CJK_FONT_INSTALL_TIMEOUT_SECONDS:-45}" sh -lc 'DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends fonts-noto-cjk fonts-noto-cjk-extra'; then printf '%s\n' '[WARN] CJK font installation timed out or failed; continuing startup without blocking the app.'; fi; fi &&
|
||||
printf '%s\n'
|
||||
'<?xml version="1.0"?>'
|
||||
|
||||
Reference in New Issue
Block a user