Files

DESKTOP-72TV0V4\caoxiaozhu d24b29afe4 feat: 完善 AI-Core 文档解析器

- 添加多种文档解析器 (PDF, Word, Excel, Markdown 等)
- 添加基础解析器和链式解析器
- 添加存储和注册机制
- 添加 gRPC 服务实现

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-10 15:01:52 +08:00

parser

feat: 完善 AI-Core 文档解析器

2026-03-10 15:01:52 +08:00

proto

feat: 增强 AI-Core 文档解析器

2026-03-09 15:42:35 +08:00

service

feat: 完善 AI-Core 文档解析器

2026-03-10 15:01:52 +08:00

.gitignore

refactor: 重构 algorithm 为 ai-core 代码解析服务

2026-03-09 10:27:08 +08:00

config.example.yaml

feat: 增强 AI-Core 文档解析器

2026-03-09 15:42:35 +08:00

generate_grpc.py

refactor: 重构 algorithm 为 ai-core 代码解析服务

2026-03-09 10:27:08 +08:00

main.py

feat: 完善 AI-Core 文档解析器

2026-03-10 15:01:52 +08:00

README.md

refactor: 重构 ai-core 代码结构

2026-03-09 16:08:44 +08:00

requirements.txt

feat: 完善 AI-Core 文档解析器

2026-03-10 15:01:52 +08:00

start.bat

chore: 优化 AI-Core 启动脚本

2026-03-09 12:50:33 +08:00

start.sh

chore: 优化 AI-Core 启动脚本

2026-03-09 12:50:33 +08:00

README.md

AI-Core 文档解析服务

基于 Python 的 gRPC 文档解析服务，支持多种文件格式转换为 Markdown。

功能特性

支持多种文件格式：PDF、DOCX、DOC、XLSX、XLS、CSV、Markdown、图片等
多解析引擎支持（builtin、markitdown）
gRPC 接口，高性能通信
支持通过 URL 下载文件并解析
可配置的解析引擎和参数

项目结构

ai-core/
├── main.py                      # 服务启动入口
├── requirements.txt             # Python 依赖
├── proto/                       # gRPC 协议定义
│   └── document_parser.proto    # Protocol Buffers 定义
├── parser/                      # 文档解析器模块
│   ├── base_parser.py           # 基础解析器接口
│   ├── parser.py                # 解析器门面
│   ├── registry.py              # 解析器注册表
│   ├── docx_parser.py           # DOCX 解析器
│   ├── pdf_parser.py            # PDF 解析器
│   └── ...
└── service/                     # gRPC 服务实现
    └── grpc_server.py           # gRPC 服务器

安装

1. 安装依赖

pip install -r requirements.txt

2. 生成 gRPC 代码

python -m grpc_tools.protoc \
    --proto_path=proto \
    --python_out=proto \
    --grpc_python_out=proto \
    proto/document_parser.proto

使用

启动服务

python main.py --port 50051 --max-workers 10

参数说明：

--port: gRPC 服务端口（默认 50051）
--max-workers: 最大工作线程数（默认 10）
--log-level: 日志级别（DEBUG/INFO/WARNING/ERROR，默认 INFO）

gRPC 接口

ParseDocument

解析文档为 Markdown

message ParseRequest {
  string file_url = 1;                    // 文件 URL（必填）
  string file_name = 2;                   // 文件名（必填）
  string file_type = 3;                   // 文件类型（必填，如 pdf、docx）
  string parser_engine = 4;               // 解析引擎（可选，默认 builtin）
  map<string, string> engine_overrides = 5;// 引擎参数覆盖（可选）
}

message ParseResponse {
  bool success = 1;                       // 是否成功
  string content = 2;                      // Markdown 内容
  string message = 3;                     // 消息
  int32 content_length = 4;              // 内容长度
  string file_type = 5;                   // 文件类型
  string parser_engine = 6;               // 使用的解析引擎
}

GetSupportedFormats

获取支持的文件格式列表

GetEngines

获取可用的解析引擎列表

Go 客户端调用示例

conn, err := grpc.Dial("localhost:50051", grpc.WithTransportCredentials(insecure.NewCredentials()))
if err != nil {
    log.Fatalf("Failed to connect: %v", err)
}
defer conn.Close()

client := docparser.NewDocumentParserClient(conn)

resp, err := client.ParseDocument(context.Background(), &docparser.ParseRequest{
    FileUrl:   "http://localhost:8082/files/abc123.pdf",
    FileName:  "example.pdf",
    FileType:  "pdf",
    ParserEngine: "builtin",
})

if err != nil {
    log.Fatalf("Failed to parse: %v", err)
}

fmt.Println("Markdown content:")
fmt.Println(resp.Content)

支持的文件格式

格式	扩展名	说明
PDF	pdf	PDF 文档
Word	docx, doc	Microsoft Word 文档
Excel	xlsx, xls	Microsoft Excel 表格
CSV	csv	逗号分隔值文件
Markdown	md, markdown	Markdown 文件
图片	jpg, jpeg, png, gif, bmp, tiff, webp	常见图片格式
PowerPoint	pptx, ppt	PowerPoint 演示文稿

开发

添加新的解析器

继承 BaseParser 类
实现 parse_into_text 方法
在 registry.py 中注册

添加新的解析引擎

在 registry.py 中使用 register() 方法注册
提供 check_available 函数检查依赖
添加对应的解析器类

许可证

MIT License

README.md Unescape Escape

AI-Core 文档解析服务

功能特性

项目结构

安装

1. 安装依赖

2. 生成 gRPC 代码

使用

启动服务

gRPC 接口

ParseDocument

GetSupportedFormats

GetEngines

Go 客户端调用示例

支持的文件格式

开发

添加新的解析器

添加新的解析引擎

许可证

README.md