Files
X-Financial/server/scripts/bootstrap_paddleocr_mobile.sh
caoxiaozhu 88e91a5900 feat(ocr): PDF 文本层可用时跳过 worker 调用并补装 poppler-data
- OcrService 提取 PDF 文本层后若有效字符达到阈值,直接构建文档并写入结果缓存,不再触发 OCR worker,仅无文本层时才解析 python_bin/worker_path 调用 worker
- _build_text_layer_document 复用 AggregatedOcrDocument 聚合文本层片段,_has_usable_pdf_text_layer 基于 meaningful_char_count 判定
- docker-compose 与 paddleocr bootstrap 脚本补装 poppler-data,保证 PDF 文本层抽取的中文编码正确
- 新增文本层直取与运行时依赖两项 ocr_service 单测
2026-06-21 23:23:59 +08:00

23 lines
835 B
Bash

#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
OCR_VENV_DIR="${ROOT_DIR}/.venv-ocr312"
PYTHON_BIN="${PYTHON_BIN:-python3}"
PADDLEPADDLE_VERSION="${PADDLEPADDLE_VERSION:-3.2.2}"
PADDLEOCR_VERSION="${PADDLEOCR_VERSION:-3.6.0}"
if ! command -v "${PYTHON_BIN}" >/dev/null 2>&1; then
echo "${PYTHON_BIN} 不存在,请先安装 Python 3。" >&2
exit 1
fi
apt-get update
apt-get install -y --no-install-recommends libgl1 libglib2.0-0 poppler-utils poppler-data
"${PYTHON_BIN}" -m venv "${OCR_VENV_DIR}"
"${OCR_VENV_DIR}/bin/pip" install --upgrade pip
"${OCR_VENV_DIR}/bin/pip" install "paddlepaddle==${PADDLEPADDLE_VERSION}" "paddleocr==${PADDLEOCR_VERSION}"
echo "PaddleOCR mobile runtime ${PADDLEOCR_VERSION} / PaddlePaddle ${PADDLEPADDLE_VERSION} 已安装到 ${OCR_VENV_DIR}"