- OcrService 提取 PDF 文本层后若有效字符达到阈值,直接构建文档并写入结果缓存,不再触发 OCR worker,仅无文本层时才解析 python_bin/worker_path 调用 worker - _build_text_layer_document 复用 AggregatedOcrDocument 聚合文本层片段,_has_usable_pdf_text_layer 基于 meaningful_char_count 判定 - docker-compose 与 paddleocr bootstrap 脚本补装 poppler-data,保证 PDF 文本层抽取的中文编码正确 - 新增文本层直取与运行时依赖两项 ocr_service 单测
37 lines
1.2 KiB
Bash
37 lines
1.2 KiB
Bash
#!/usr/bin/env bash
|
|
set -euo pipefail
|
|
|
|
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
|
OCR_VENV_DIR="${OCR_VENV_DIR:-${ROOT_DIR}/.venv-ocr312}"
|
|
PYTHON_BIN="${PYTHON_BIN:-python3.12}"
|
|
PADDLEPADDLE_GPU_VERSION="${PADDLEPADDLE_GPU_VERSION:-3.3.0}"
|
|
PADDLEOCR_VERSION="${PADDLEOCR_VERSION:-3.6.0}"
|
|
PADDLE_GPU_INDEX_URL="${PADDLE_GPU_INDEX_URL:-https://www.paddlepaddle.org.cn/packages/stable/cu126/}"
|
|
|
|
if ! command -v "${PYTHON_BIN}" >/dev/null 2>&1; then
|
|
echo "python3.12 不存在,请先安装 Python 3.12。" >&2
|
|
exit 1
|
|
fi
|
|
|
|
apt-get update
|
|
apt-get install -y --no-install-recommends libgl1 libglib2.0-0 poppler-utils poppler-data
|
|
|
|
rm -rf "${OCR_VENV_DIR}"
|
|
"${PYTHON_BIN}" -m venv "${OCR_VENV_DIR}"
|
|
"${OCR_VENV_DIR}/bin/pip" install --upgrade pip
|
|
"${OCR_VENV_DIR}/bin/pip" install \
|
|
"paddlepaddle-gpu==${PADDLEPADDLE_GPU_VERSION}" \
|
|
-i "${PADDLE_GPU_INDEX_URL}"
|
|
"${OCR_VENV_DIR}/bin/pip" install "paddleocr==${PADDLEOCR_VERSION}"
|
|
|
|
"${OCR_VENV_DIR}/bin/python" - <<'PY'
|
|
import paddle
|
|
|
|
print("PaddlePaddle:", paddle.__version__)
|
|
print("CUDA compiled:", paddle.is_compiled_with_cuda())
|
|
print("CUDA device count:", paddle.device.cuda.device_count())
|
|
paddle.utils.run_check()
|
|
PY
|
|
|
|
echo "PaddleOCR GPU runtime ${PADDLEOCR_VERSION} 已安装到 ${OCR_VENV_DIR}"
|