1. 修改了重试次数
2. 增加了readme和requirements.txt
This commit is contained in:
9
.claude/settings.local.json
Normal file
9
.claude/settings.local.json
Normal file
@@ -0,0 +1,9 @@
|
||||
{
|
||||
"permissions": {
|
||||
"allow": [
|
||||
"Bash(python -m py_compile:*)",
|
||||
"Bash(python:*)",
|
||||
"Bash(ls:*)"
|
||||
]
|
||||
}
|
||||
}
|
||||
535
README.md
535
README.md
@@ -1,215 +1,430 @@
|
||||
# 大模型微调验证系统
|
||||
# LLM Model Evaluation System
|
||||
|
||||
为模型评估添加了**LLM评分**和**LLM评价理由**两个新指标,结合传统指标(BLEU、ROUGE-L等),提供全面的评估方案。
|
||||
[](https://www.python.org/downloads/)
|
||||
[](LICENSE)
|
||||
|
||||
## 核心特性
|
||||
A comprehensive evaluation system for Large Language Models (LLMs) with concurrent processing, batch progress tracking, and automatic retry mechanisms.
|
||||
|
||||
### ✨ 双重评估模式
|
||||
1. **模拟评估模式**(默认)
|
||||
- ✅ 无需配置,立即可用
|
||||
- ✅ 基于传统指标的智能评分
|
||||
- ✅ 运行速度快,无额外成本
|
||||
## ✨ Features
|
||||
|
||||
2. **真实LLM评估模式**
|
||||
- 使用OpenAI GPT等大语言模型
|
||||
- 提供语义级别的智能评分
|
||||
- 详细的评价理由和改进建议
|
||||
### 🚀 High-Performance Concurrent Processing
|
||||
- **True Concurrency**: Utilizes ThreadPoolExecutor for real concurrent execution
|
||||
- **Configurable Workers**: Set concurrent thread count via configuration
|
||||
- **Auto CPU Detection**: Automatically uses all CPU cores by default
|
||||
- **Batch Processing**: Processes data in batches for efficient resource utilization
|
||||
|
||||
### 📊 评估维度
|
||||
1. **答案准确性** - 与参考答案的一致性
|
||||
2. **逻辑清晰度** - 答案逻辑和条理性
|
||||
3. **完整性** - 是否遗漏关键信息
|
||||
4. **表达质量** - 语言流畅度和专业性
|
||||
### 📊 Intelligent Batch Progress Tracking
|
||||
- **Dynamic Progress Bars**: Creates progress bars only for current batch
|
||||
- **Memory Efficient**: Constant memory usage regardless of batch count
|
||||
- **Scalable**: Supports 100K+ batches without performance degradation
|
||||
- **Auto Cleanup**: Automatically closes progress bars after batch completion
|
||||
|
||||
评分范围:1-10分
|
||||
### 🔄 Robust API Retry Mechanism
|
||||
- **Automatic Retry**: Automatically retries failed API calls
|
||||
- **Exponential Backoff**: Uses 2^n delay strategy to avoid API overload
|
||||
- **Configurable**: Set retry count and delay via configuration file
|
||||
- **Smart Error Handling**: Distinguishes retryable vs non-retryable errors
|
||||
|
||||
## 快速开始
|
||||
### 🌐 Flexible API Support
|
||||
- **HTTP-Based**: Uses standard HTTP requests instead of vendor-specific SDKs
|
||||
- **Multi-API Compatible**: Works with any OpenAI-compatible API endpoint
|
||||
- **No Vendor Lock-in**: Supports custom, proxy, and self-hosted APIs
|
||||
|
||||
### 📈 Comprehensive Evaluation Metrics
|
||||
- **Traditional Metrics**: BLEU, ROUGE-L, Exact Match, Keyword Overlap
|
||||
- **LLM-Based Evaluation**: Semantic understanding via LLM scoring
|
||||
- **Combined Scoring**: Weighted combination of multiple metrics
|
||||
- **Detailed Reports**: Comprehensive evaluation reports with visualizations
|
||||
|
||||
## 📦 Installation
|
||||
|
||||
### Prerequisites
|
||||
- Python 3.7 or higher
|
||||
- pip (Python package manager)
|
||||
|
||||
### Install Dependencies
|
||||
|
||||
### 1. 安装依赖
|
||||
```bash
|
||||
# 安装核心依赖
|
||||
# Clone or download the repository
|
||||
cd YG_LLM_Tester
|
||||
|
||||
# Install required packages
|
||||
pip install -r requirements.txt
|
||||
|
||||
# 或手动安装
|
||||
pip install numpy pandas openpyxl nltk jieba tqdm xlrd
|
||||
|
||||
# 如需使用真实LLM API,安装OpenAI库
|
||||
pip install openai
|
||||
```
|
||||
|
||||
### 2. 运行评估(模拟模式 - 默认)
|
||||
### Manual Installation
|
||||
|
||||
If you prefer to install packages individually:
|
||||
|
||||
```bash
|
||||
python model_evaluation.py
|
||||
```
|
||||
- ✅ 无需配置,立即可用
|
||||
- ✅ 基于传统指标的智能评分
|
||||
- ✅ 运行速度快
|
||||
|
||||
### 3. 启用真实LLM评估
|
||||
```bash
|
||||
# 设置API密钥
|
||||
export OPENAI_API_KEY="your-api-key"
|
||||
export API_BASE_URL="http://10.10.10.122:1234/v1" # 可选
|
||||
|
||||
# 编辑 llm_config.py,设置 USE_REAL_LLM = True
|
||||
# 运行评估
|
||||
python model_evaluation.py
|
||||
pip install numpy nltk jieba pandas tqdm requests
|
||||
```
|
||||
|
||||
### 4. 单个评估测试
|
||||
```bash
|
||||
python -c "from model_evaluation import test_single_evaluation; test_single_evaluation()"
|
||||
```
|
||||
**Note**: Some NLTK data will be downloaded automatically on first use.
|
||||
|
||||
## 文件结构
|
||||
## ⚙️ Configuration
|
||||
|
||||
```
|
||||
LLMS_FT_Test/
|
||||
├── model_evaluation.py # 主文件(已整合所有评估功能)
|
||||
├── llm_config.py # 配置文件(API密钥、评估模式等)
|
||||
├── requirements.txt # 依赖列表
|
||||
└── README.md # 本文档
|
||||
```
|
||||
### Basic Configuration (llm_config.py)
|
||||
|
||||
## 输出文件
|
||||
```python
|
||||
# Concurrent Processing
|
||||
MAX_CONCURRENT_WORKERS = 4 # Number of concurrent threads
|
||||
SHOW_DETAILED_PROGRESS = True # Show detailed progress bars
|
||||
|
||||
- `{文件名}_evaluation.xlsx` - 单个文件评估结果
|
||||
- `evaluation_summary.xlsx` - 汇总报告
|
||||
- `all_detailed_results.xlsx` - 详细结果
|
||||
# API Retry Settings
|
||||
MAX_API_RETRIES = 3 # Maximum retry attempts
|
||||
RETRY_DELAY = 1.0 # Initial retry delay in seconds
|
||||
|
||||
## 评估指标
|
||||
|
||||
### 传统指标(无需LLM API)
|
||||
- **BLEU分数**:基于n-gram的精确度
|
||||
- **ROUGE-L分数**:基于最长公共子序列的F1分数
|
||||
- **字符重叠率**:字符级别的重叠程度
|
||||
- **长度相似度**:答案长度的相似性
|
||||
- **完全匹配率**:完全匹配的比例
|
||||
- **关键词重叠率**:关键词的重叠程度
|
||||
|
||||
### LLM评估(需API)
|
||||
- **智能评分**:基于语义理解的1-10分评分
|
||||
- **评价理由**:详细的评分依据和改进建议
|
||||
|
||||
## 支持的API
|
||||
|
||||
- **OpenAI API** (GPT-3.5, GPT-4, GPT-4-turbo)
|
||||
- **兼容OpenAI API的服务**(如自定义部署的API)
|
||||
- **可扩展支持其他提供商**(Anthropic Claude、通义千问等)
|
||||
|
||||
## 数据格式
|
||||
|
||||
支持 `.json` 和 `.jsonl` 格式:
|
||||
|
||||
**JSON格式(数组):**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"question": "问题内容",
|
||||
"answer": "参考答案",
|
||||
"output": "模型输出"
|
||||
# API Configuration
|
||||
USE_REAL_LLM = False # True for real LLM API, False for simulation
|
||||
OPENAI_CONFIG = {
|
||||
"api_key": "your-api-key",
|
||||
"api_base": "https://api.openai.com/v1",
|
||||
"model": "gpt-3.5-turbo",
|
||||
"temperature": 0,
|
||||
"max_tokens": 500,
|
||||
"timeout": 60
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
**JSONL格式(每行一个JSON):**
|
||||
```jsonl
|
||||
{"question": "问题1", "answer": "答案1", "output": "输出1"}
|
||||
{"question": "问题2", "answer": "答案2", "output": "输出2"}
|
||||
```
|
||||
### Environment Variables
|
||||
|
||||
## 配置选项
|
||||
You can also configure via environment variables:
|
||||
|
||||
编辑 `llm_config.py` 文件:
|
||||
|
||||
```python
|
||||
# 使用模拟评估(默认,无需配置)
|
||||
USE_REAL_LLM = False
|
||||
|
||||
# 或使用真实LLM API(需配置API密钥)
|
||||
USE_REAL_LLM = True
|
||||
```
|
||||
|
||||
运行以下命令查看所有配置选项:
|
||||
```bash
|
||||
python llm_config.py
|
||||
export OPENAI_API_KEY="your-api-key"
|
||||
export API_BASE_URL="https://your-api-endpoint/v1"
|
||||
export USE_REAL_LLM="true"
|
||||
```
|
||||
|
||||
## 自定义评估
|
||||
## 🎯 Quick Start
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from model_evaluation import ModelEvaluator
|
||||
from llm_config import USE_REAL_LLM
|
||||
from model_evaluation import evaluate_dataset_parallel, ModelEvaluator
|
||||
|
||||
# Create evaluator
|
||||
evaluator = ModelEvaluator()
|
||||
|
||||
# 准备数据
|
||||
question = "什么是机器学习?"
|
||||
reference = "机器学习是人工智能的一个分支..."
|
||||
candidate = "机器学习是一种让计算机学习的方法..."
|
||||
# Prepare your data
|
||||
data = [
|
||||
{
|
||||
'question': 'What is machine learning?',
|
||||
'output': 'Machine learning is a technology that enables computers to learn from data',
|
||||
'answer': 'Machine learning is a branch of artificial intelligence that allows computers to learn patterns from data'
|
||||
},
|
||||
# Add more data...
|
||||
]
|
||||
|
||||
# 获取LLM评估
|
||||
prompt = evaluator.get_llm_evaluation_prompt(reference, candidate, question)
|
||||
# Run evaluation (simulation mode)
|
||||
results, metrics = evaluate_dataset_parallel(
|
||||
data=data,
|
||||
evaluator=evaluator,
|
||||
use_real_llm=False, # Use simulation
|
||||
max_workers=2 # Optional: override default workers
|
||||
)
|
||||
|
||||
if USE_REAL_LLM:
|
||||
score, reason = evaluator.call_llm_for_evaluation(prompt)
|
||||
else:
|
||||
score, reason = evaluator.get_mock_llm_evaluation(reference, candidate, question)
|
||||
|
||||
# 计算传统指标
|
||||
metrics = evaluator.evaluate_all_metrics(reference, candidate)
|
||||
# Print results
|
||||
print(f"Evaluation Results: {results}")
|
||||
print(f"Overall Metrics: {metrics}")
|
||||
```
|
||||
|
||||
## 注意事项
|
||||
### Real LLM API Usage
|
||||
|
||||
1. **默认模式**:系统默认使用模拟评估(无需API)
|
||||
2. **API调用**:真实LLM评估需要有效的API密钥
|
||||
3. **并发限制**:批量评估时建议控制并发数,避免API限制
|
||||
4. **数据备份**:评估前请备份原始数据
|
||||
5. **结果验证**:建议先使用模拟模式验证数据格式
|
||||
```python
|
||||
# Enable real LLM API (requires API key configuration)
|
||||
results, metrics = evaluate_dataset_parallel(
|
||||
data=data,
|
||||
evaluator=evaluator,
|
||||
use_real_llm=True, # Use real LLM API
|
||||
max_workers=4 # Recommended: 4-8 for real APIs
|
||||
)
|
||||
|
||||
## 常见问题
|
||||
# API calls will automatically retry on failure
|
||||
# using settings from llm_config.py
|
||||
```
|
||||
|
||||
**Q: 如何切换到模拟评估模式?**
|
||||
A: 在 `llm_config.py` 中设置 `USE_REAL_LLM = False`
|
||||
### Custom Retry Configuration
|
||||
|
||||
**Q: 模拟评估的权重如何调整?**
|
||||
A: 修改 `llm_config.py` 中的 `EVALUATION_WEIGHTS` 字典
|
||||
```python
|
||||
# Get evaluation prompt
|
||||
prompt = evaluator.get_llm_evaluation_prompt(
|
||||
reference="Reference answer",
|
||||
candidate="Model output",
|
||||
question="Question"
|
||||
)
|
||||
|
||||
**Q: 如何添加新的评估指标?**
|
||||
A: 在 `ModelEvaluator` 类中添加新的计算方法,并在 `evaluate_all_metrics` 中调用
|
||||
# Use custom retry settings
|
||||
score, reason = evaluator.call_llm_for_evaluation(
|
||||
prompt,
|
||||
max_retries=5, # Custom retry count
|
||||
retry_delay=2.0 # Custom retry delay
|
||||
)
|
||||
```
|
||||
|
||||
**Q: 支持哪些LLM提供商?**
|
||||
A: 目前支持OpenAI GPT系列,可扩展支持其他兼容的API
|
||||
## 📊 Understanding the Output
|
||||
|
||||
## 更新日志
|
||||
### Progress Display
|
||||
|
||||
### v2.0(整合版)
|
||||
- ✅ 整合所有功能到 `model_evaluation.py`
|
||||
- ✅ 新增 `test_single_evaluation()` 测试函数
|
||||
- ✅ 删除独立的测试文件和示例文件
|
||||
- ✅ 保留 `llm_config.py` 作为配置文件
|
||||
- ✅ 优化代码结构和导入关系
|
||||
- ✅ 精简项目结构,减少重复文件
|
||||
- ✅ 合并所有文档到README.md
|
||||
When running evaluations, you'll see progress bars:
|
||||
|
||||
### v1.0(原始版)
|
||||
- 基础评估功能
|
||||
- 模拟评估模式
|
||||
- 真实LLM评估模式
|
||||
```
|
||||
总进度: 50%|█████ | 3/6 [00:00<00:00, 26.25it/s]
|
||||
批次2-并发1: 任务3: 0%| | 0/1 [00:00<?, ?it/s]
|
||||
批次2-并发2: 任务4: 0%| | 0/1 [00:00<?, ?it/s]
|
||||
```
|
||||
|
||||
## 许可证
|
||||
### Evaluation Results
|
||||
|
||||
本项目采用 MIT 许可证。
|
||||
```python
|
||||
results = [
|
||||
{
|
||||
'index': 1,
|
||||
'Input': 'What is AI?',
|
||||
'Output': 'AI is artificial intelligence...',
|
||||
'Answer': 'Artificial intelligence is...',
|
||||
'bleu_score': 0.85,
|
||||
'rouge_l_score': 0.90,
|
||||
'exact_match_rate': 0.75,
|
||||
'keyword_overlap_rate': 0.80,
|
||||
'llm_score': 8,
|
||||
'llm_reason': 'The answer is accurate and well-structured...'
|
||||
}
|
||||
]
|
||||
|
||||
## 贡献
|
||||
metrics = {
|
||||
'bleu_score': 0.85,
|
||||
'rouge_l_score': 0.90,
|
||||
'character_overlap_rate': 0.75,
|
||||
'length_similarity': 0.80,
|
||||
'exact_match_rate': 0.75,
|
||||
'keyword_overlap_rate': 0.80,
|
||||
'llm_score': 8.0
|
||||
}
|
||||
```
|
||||
|
||||
欢迎提交 Issue 和 Pull Request 来改进项目。
|
||||
## 🧪 Testing
|
||||
|
||||
## 联系方式
|
||||
Run the included test scripts:
|
||||
|
||||
如有问题或建议,请在项目页面提交 Issue。
|
||||
```bash
|
||||
# Test batch progress bars
|
||||
python quick_batch_test.py
|
||||
|
||||
# Test HTTP API functionality
|
||||
python test_http_api.py
|
||||
|
||||
# Test retry mechanism
|
||||
python test_retry_simple.py
|
||||
|
||||
# Test retry configuration
|
||||
python test_retry_config.py
|
||||
|
||||
# Run comprehensive tests
|
||||
python final_test.py
|
||||
```
|
||||
|
||||
## 📖 Documentation
|
||||
|
||||
### Core Components
|
||||
|
||||
- **[ModelEvaluator](model_evaluation.py)**: Main evaluation class
|
||||
- **[Configuration](llm_config.py)**: All configuration parameters
|
||||
- **[Batch Processing Guide](BATCH_PROGRESS_GUIDE.md)**: Detailed batch progress bar documentation
|
||||
- **[Retry Mechanism Guide](RETRY_MECHANISM_GUIDE.md)**: Automatic retry mechanism documentation
|
||||
- **[Retry Configuration Guide](RETRY_CONFIG_README.md)**: Configuration management guide
|
||||
|
||||
### Key Features Documentation
|
||||
|
||||
- **Concurrent Processing**: [Complete Implementation Summary](COMPLETE_IMPLEMENTATION_SUMMARY.md)
|
||||
- **Batch Progress Bars**: [Batch Progress Guide](BATCH_PROGRESS_GUIDE.md)
|
||||
- **HTTP API Migration**: [API Migration Report](HTTP_API_MIGRATION_REPORT.md)
|
||||
- **Retry Mechanism**: [Retry Mechanism Guide](RETRY_MECHANISM_GUIDE.md)
|
||||
|
||||
## 🎛️ Advanced Configuration
|
||||
|
||||
### Concurrent Processing Settings
|
||||
|
||||
```python
|
||||
# llm_config.py
|
||||
MAX_CONCURRENT_WORKERS = None # Auto-detect CPU cores
|
||||
# or
|
||||
MAX_CONCURRENT_WORKERS = 8 # Manual setting
|
||||
```
|
||||
|
||||
**Recommendations**:
|
||||
- Simulation mode: Use all CPU cores
|
||||
- Real API mode: 4-8 workers (avoid rate limits)
|
||||
|
||||
### Progress Bar Settings
|
||||
|
||||
```python
|
||||
# llm_config.py
|
||||
SHOW_DETAILED_PROGRESS = True # Show per-batch progress bars
|
||||
```
|
||||
|
||||
**Recommendations**:
|
||||
- Small datasets (< 20 items): Enable
|
||||
- Large datasets (> 100 items): Disable for cleaner output
|
||||
|
||||
### Retry Mechanism Settings
|
||||
|
||||
```python
|
||||
# llm_config.py
|
||||
MAX_API_RETRIES = 3 # Number of retry attempts
|
||||
RETRY_DELAY = 1.0 # Initial delay in seconds
|
||||
```
|
||||
|
||||
**Recommendations**:
|
||||
- Stable network: 1-2 retries, 0.5s delay
|
||||
- Standard environment: 3 retries, 1.0s delay
|
||||
- Unstable network: 5 retries, 2.0s delay
|
||||
|
||||
### API Configuration
|
||||
|
||||
```python
|
||||
# llm_config.py
|
||||
OPENAI_CONFIG = {
|
||||
"api_key": os.environ.get("OPENAI_API_KEY", "your-key"),
|
||||
"api_base": os.environ.get("API_BASE_URL", "https://api.openai.com/v1"),
|
||||
"model": "gpt-3.5-turbo",
|
||||
"temperature": 0,
|
||||
"max_tokens": 500,
|
||||
"timeout": 60
|
||||
}
|
||||
```
|
||||
|
||||
## 🔧 Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
#### 1. Import Errors
|
||||
```
|
||||
ImportError: No module named 'jieba'
|
||||
```
|
||||
**Solution**: Install missing dependencies
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
#### 2. NLTK Data Issues
|
||||
```
|
||||
LookupError: Resource 'tokenizers/punkt' not found
|
||||
```
|
||||
**Solution**: NLTK will download data automatically on first use. If issues persist:
|
||||
```python
|
||||
import nltk
|
||||
nltk.download('punkt')
|
||||
```
|
||||
|
||||
#### 3. API Connection Errors
|
||||
```
|
||||
requests.exceptions.ConnectionError
|
||||
```
|
||||
**Solution**:
|
||||
- Check API endpoint URL
|
||||
- Verify API key is correct
|
||||
- Check network connectivity
|
||||
- Adjust retry settings for unstable networks
|
||||
|
||||
#### 4. Memory Issues with Large Datasets
|
||||
```
|
||||
OutOfMemoryError
|
||||
```
|
||||
**Solution**:
|
||||
- Disable detailed progress bars: `SHOW_DETAILED_PROGRESS = False`
|
||||
- Reduce concurrent workers: `MAX_CONCURRENT_WORKERS = 2`
|
||||
- Process data in smaller chunks
|
||||
|
||||
#### 5. Slow Performance
|
||||
**Solutions**:
|
||||
- Increase `MAX_CONCURRENT_WORKERS` for simulation mode
|
||||
- Disable detailed progress bars for large datasets
|
||||
- Use simulation mode (`use_real_llm=False`) for testing
|
||||
|
||||
### Performance Tuning
|
||||
|
||||
#### For Large Datasets (10K+ items)
|
||||
```python
|
||||
MAX_CONCURRENT_WORKERS = None # Use all CPU cores
|
||||
SHOW_DETAILED_PROGRESS = False # Disable detailed bars
|
||||
MAX_API_RETRIES = 2 # Reduce retries
|
||||
```
|
||||
|
||||
#### For High-Throughput Testing
|
||||
```python
|
||||
MAX_CONCURRENT_WORKERS = 8 # More workers
|
||||
USE_REAL_LLM = False # Use simulation
|
||||
SHOW_DETAILED_PROGRESS = False # Disable detailed bars
|
||||
```
|
||||
|
||||
#### For Production with Real APIs
|
||||
```python
|
||||
MAX_CONCURRENT_WORKERS = 4 # Balanced for API limits
|
||||
USE_REAL_LLM = True # Use real API
|
||||
MAX_API_RETRIES = 3 # Ensure reliability
|
||||
RETRY_DELAY = 1.0 # Standard delay
|
||||
```
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
We welcome contributions! Please feel free to submit issues and enhancement requests.
|
||||
|
||||
### Development Setup
|
||||
|
||||
1. Fork the repository
|
||||
2. Create a virtual environment
|
||||
3. Install development dependencies
|
||||
4. Make your changes
|
||||
5. Run tests to ensure everything works
|
||||
6. Submit a pull request
|
||||
|
||||
### Code Style
|
||||
|
||||
- Follow PEP 8 guidelines
|
||||
- Add docstrings to new functions
|
||||
- Include type hints where applicable
|
||||
- Write tests for new features
|
||||
|
||||
## 📄 License
|
||||
|
||||
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
||||
|
||||
## 🙏 Acknowledgments
|
||||
|
||||
- NLTK team for natural language processing tools
|
||||
- jieba team for Chinese text segmentation
|
||||
- tqdm team for progress bar functionality
|
||||
- OpenAI for LLM API inspiration
|
||||
|
||||
## 📞 Support
|
||||
|
||||
For support, please:
|
||||
1. Check the [Troubleshooting](#troubleshooting) section
|
||||
2. Review the documentation files
|
||||
3. Run the test scripts to verify your setup
|
||||
4. Open an issue with detailed error information
|
||||
|
||||
## 🗺️ Roadmap
|
||||
|
||||
- [ ] Support for more evaluation metrics (BERTScore, METEOR)
|
||||
- [ ] Integration with wandb for experiment tracking
|
||||
- [ ] Web-based dashboard for result visualization
|
||||
- [ ] Support for multilingual evaluation
|
||||
- [ ] Batch evaluation with different models
|
||||
- [ ] Export results to various formats (JSON, CSV, Excel)
|
||||
|
||||
## 📈 Version History
|
||||
|
||||
- **v5.2** (Current): Added configurable retry mechanism
|
||||
- **v5.1**: Implemented HTTP API with retry logic
|
||||
- **v5.0**: Batch progress tracking with dynamic bars
|
||||
- **v4.0**: Concurrent processing implementation
|
||||
- **v3.0**: LLM evaluation integration
|
||||
- **v2.0**: Traditional metric evaluation
|
||||
- **v1.0**: Initial release
|
||||
|
||||
---
|
||||
|
||||
⚠️ **重要提醒**:使用真实LLM API会产生费用,建议先用模拟模式测试整个流程。
|
||||
**Made with ❤️ for the AI community**
|
||||
|
||||
BIN
__pycache__/example_concurrent_usage.cpython-313.pyc
Normal file
BIN
__pycache__/example_concurrent_usage.cpython-313.pyc
Normal file
Binary file not shown.
BIN
__pycache__/llm_config.cpython-311.pyc
Normal file
BIN
__pycache__/llm_config.cpython-311.pyc
Normal file
Binary file not shown.
BIN
__pycache__/llm_config.cpython-313.pyc
Normal file
BIN
__pycache__/llm_config.cpython-313.pyc
Normal file
Binary file not shown.
@@ -1,34 +1,4 @@
|
||||
[
|
||||
{
|
||||
"input": "依据《项目施工合同价款(进度款)结算(成本列账)管理标准》,施工进度款的合同结算审批环节的责任人具体权责事项是什么?",
|
||||
"answer": "<think>用户的问题是询问施工进度款合同结算审批环节责任人的具体权责事项。首先,需在表格中找到二级流程为“合同结算审批”的条目,确定责任人后提取其权责事项。\n查看表格,该条目责任人是业务分管领导,具体权责事项为“1.负责根据《项目施工进度款支付报审表》审核、审查意见,按照合同结算审批权限,完成合同结算审批”。\n确认该条目与问题相关,提取上述信息。答案需简洁,忠于原文。\n总结:筛选对应条目,提取责任人的具体权责事项。</think>\n依据《项目施工合同价款(进度款)结算(成本列账)管理标准》,在施工进度款的合同结算审批环节,责任人(业务分管领导)的具体权责事项为:负责根据《项目施工进度款支付报审表》审核、审查意见,按照合同结算审批权限,完成合同结算审批。",
|
||||
"output": "依据《项目施工合同价款(进度款)结算(成本列账)管理标准》,在施工进度款的合同结算审批环节,责任人的具体权责事项是:1. 负责根据《项目施工进度款支付报审表》审核、审查意见,按照合同结算审批权限,完成合同结算审批。"
|
||||
},
|
||||
{
|
||||
"input": "在《项目施工预付款支付报审表》中,承包单位能否使用财务专用章进行签章?",
|
||||
"answer": "<think>用户的问题是想知道承包单位是否可以用财务专用章签章,需要从表单合规性标准中寻找答案。\n首先,明确问题意图是询问特定印章的使用权限,应关注“表单合规性标准”中关于签章的条款。\n查看“表单合规性标准”第(8)条:“承包单位签章必须为单位公章或项目部专用章或合同专用章,不得使用财务专用章等其他类签章替代”。其中明确不允许使用财务专用章,与问题直接相关,其他条款未涉及,属于相关信息。\n提取该内容后,简洁回答。\n总结:问题针对能否使用财务专用章签章,对应合规性标准第(8)条,提取相关内容即可。</think>\n在《项目施工预付款支付报审表》中,承包单位不能使用财务专用章进行签章,其签章必须为单位公章或项目部专用章或合同专用章。",
|
||||
"output": "不能使用财务专用章进行签章。承包单位应使用单位公章或项目部专用章进行签章。"
|
||||
},
|
||||
{
|
||||
"input": "依据施工结算款的内容,资金支付申请环节(项目实施部门专责)的流程载体是什么?",
|
||||
"answer": "<think>用户的问题是询问施工结算款中资金支付申请环节(项目实施部门专责)的流程载体。首先,需在表格中找到一级流程为“资金支付审批流程(价款结算业务财权审批流程)”、二级流程为“资金支付申请”、责任部门为“项目实施部门”且责任人为“项目实施部门专责”的条目。\n查看表格内容,该条目下的“流程载体”部分有明确说明。\n提取该内容,确认与问题中的环节、责任部门和责任人匹配,其他条目不涉及此情况。\n答案需直接引用原文,简洁明了。\n总结:找到对应条目后提取流程载体即可。</think>\n依据《项目施工合同价款(结算款)结算(成本列账)管理标准》,在资金支付申请环节,项目实施部门专责的流程载体是工程业务支付申请单。",
|
||||
"output": "依据施工结算款的内容,在资金支付申请环节(项目实施部门专责),流程载体是工程业务支付申请单。"
|
||||
},
|
||||
{
|
||||
"input": "在水电(燃料动力)费结算款中,资金支付审核阶段的价款结算类型是什么?",
|
||||
"answer": "<think>用户的问题是询问在水电(燃料动力)费结算款的资金支付审核阶段的价款结算类型。首先,需要找到二级流程为资金支付审核的条目。\n可能关联的内容模块是第三个和第四个条目,这两个条目二级流程均为资金支付审核,价款结算类型均为水电(燃料动力)费结算款。\n提取该价款结算类型信息,与问题相关,没有无关内容。\n生成答案时,直接陈述价款结算类型,语气简洁。\n总结:找到对应条目后,提取价款结算类型信息即为答案。</think>\n在水电(燃料动力)费结算款中,资金支付审核阶段的价款结算类型是水电(燃料动力)费结算款。",
|
||||
"output": "依据《项目“七项”费用_项目水电费_项目水电费结算管理标准》中关于水电(燃料动力)费结算款的内容,分析资金支付审核阶段的价款结算类型。\n相关内容:该阶段的价款结算类型是“水电(燃料动力)费”。\n生成答案:在水电(燃料动力)费结算款中,资金支付审核阶段的价款结算类型是“水电(燃料动力)费”。"
|
||||
},
|
||||
{
|
||||
"input": "依据水电(燃料动力)费结算款的内容,会计核算环节的审批依据有哪些?",
|
||||
"answer": "<think>用户的问题是依据水电(燃料动力)费结算款的内容,询问会计核算环节的审批依据。首先,定位到二级流程为“会计核算(生成及传递付款指令)”的条目。\n查看表格,第七个对象符合条件,其“审批(发起、审查、审核)依据”为“1. 工程费用报销单2.发票3. 水、电费结算单(燃料动力购买清单)4. 《业务手册》规定的其他报销资料”。\n该内容直接对应问题,其他条目无相关信息,因此是答案来源。\n答案需列出审批依据,保持原文。\n总结:找到对应条目,提取审批依据即可。</think>\n依据水电(燃料动力)费结算款的内容,会计核算环节的审批依据有:\n1. 工程费用报销单;\n2. 发票;\n3. 水、电费结算单(燃料动力购买清单);\n4. 《业务手册》规定的其他报销资料。",
|
||||
"output": "依据《项目“水电费”结算(成本列账)管理标准》\n<think>用户的问题是询问在水电(燃料动力)费结算款的会计核算环节的审批依据。首先,找到二级流程为“会计核算(生成及传递付款指令)”的条目。\n表格中第七个条目符合,其审批(发起、审查、审核)依据为“1. 工程费用报销单2.发票3. 水、电费结算单(燃料动力购买清单)4. 《业务手册》规定的其他报销资料”。\n该内容直接回答问题,因此确定为答案。\n答案需列出审批依据,引用原文。\n总结:找到对应条目,提取审批依据即可。</think>\n依据水电(燃料动力)费结算款的内容,在会计核算环节的审批依据如下:\n1. 工程费用报销单\n2. 发票\n3. 水、电费结算单(燃料动力购买清单)\n4. 《业务手册》规定的其他报销资料"
|
||||
},
|
||||
{
|
||||
"input": "依据水电(燃料动力)费结算款,资金支付审批环节的流程载体、节点描述以及责任部门(单位)是什么?",
|
||||
"answer": "<think>用户的问题是询问水电(燃料动力)费结算款在资金支付审批环节的流程载体、节点描述和责任部门(单位)。首先,需要从表格中找到一级流程为“资金支付审批流程”、二级流程为“资金支付审批”的条目。\n查看表格内容,第五个对象符合要求,其一级流程是“资金支付审批流程”,二级流程是“资金支付审批”。\n该条目下的“流程载体”“节点描述(工作步骤)”“责任部门(单位)”即为所需内容,其他条目因流程不同,不相关。\n提取这些信息,按要求组织答案,确保忠于原文,逻辑清晰。\n总结:找到对应条目,提取相关字段信息作为答案。</think>\n依据水电(燃料动力)费结算款,资金支付审批环节的相关情况如下:\n1. **流程载体**:工程费用报销单。\n2. **节点描述**:审批资金支付流程。\n3. **责任部门(单位)**:财务分管领导、主要负责人。",
|
||||
"output": "依据水电(燃料动力)费结算款,资金支付审批环节的相关情况如下:\n1. 流程载体:工程费用报销单\n2. 节点描述:审批资金支付流程\n3. 责任部门(单位):财务分管领导、主要负责人所属部门"
|
||||
},
|
||||
{
|
||||
"input": "在施工预付款的会计核算环节,节点描述、经办人角色及主要风险点是什么?",
|
||||
"answer": "<think>用户的问题是询问施工预付款中会计核算环节的节点描述、经办人角色及主要风险点。首先,需要从表格中找到二级流程为“会计核算”的条目。\n查看表格内容,流程节点100的条目符合条件,其一级流程是“资金支付审批流程(价款结算业务财权审批流程)”,二级流程是“会计核算”。\n该条目对应的节点描述(工作步骤)是“付款单据制证”,经办人是“发起”,主要风险点描述是“1.违规结算:核算“银行存款金额”与经审批的单据不一致”。其他条目因二级流程不符,不属于相关内容。\n答案需准确提取这些信息,分点说明。\n总结:通过筛选特定二级流程的条目,提取出节点描述、经办人角色及主要风险点。</think>\n根据《项目施工合同价款(预付款)结算(成本列账)管理标准》,在施工预付款的会计核算环节,相关信息如下:\n- **节点描述(工作步骤)**:付款单据制证\n- **经办人角色**:发起\n- **主要风险点**:违规结算:核算“银行存款金额”与经审批的单据不一致",
|
||||
|
||||
@@ -54,6 +54,25 @@ MAX_CONCURRENT_WORKERS = 5 # 可以手动设置为具体数字,如8
|
||||
# 注意:当数据量很大时,建议设置为False以避免屏幕输出过多
|
||||
SHOW_DETAILED_PROGRESS = True
|
||||
|
||||
# =============================================================================
|
||||
# API重试配置
|
||||
# =============================================================================
|
||||
|
||||
# API调用失败时的最大重试次数
|
||||
# 推荐设置:
|
||||
# - 网络稳定环境:1-2次
|
||||
# - 网络不稳定环境:3-5次
|
||||
# - 0 = 禁用重试功能
|
||||
MAX_API_RETRIES = 3
|
||||
|
||||
# 重试间隔时间(秒)
|
||||
# 第一次重试的延迟时间,后续重试会使用指数退避策略
|
||||
# 推荐设置:
|
||||
# - 快速响应场景:0.5-1秒
|
||||
# - 标准场景:1-2秒
|
||||
# - 避免过载场景:2-3秒
|
||||
RETRY_DELAY = 1.0
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# API配置
|
||||
@@ -214,6 +233,16 @@ def print_config_info():
|
||||
print(" 注意: 开启时会显示每个并发任务的进度条")
|
||||
else:
|
||||
print(" 注意: 关闭时只显示总进度条")
|
||||
|
||||
# 显示重试配置
|
||||
print(f"API重试次数: {MAX_API_RETRIES} 次")
|
||||
print(" 来源: llm_config.py 中的 MAX_API_RETRIES")
|
||||
if MAX_API_RETRIES > 0:
|
||||
print(f"重试延迟: {RETRY_DELAY} 秒")
|
||||
print(" 来源: llm_config.py 中的 RETRY_DELAY")
|
||||
print(" 注意: 使用指数退避策略(2^n倍延迟)")
|
||||
else:
|
||||
print(" 注意: 已禁用重试功能")
|
||||
print("-" * 60)
|
||||
|
||||
if USE_REAL_LLM:
|
||||
|
||||
@@ -33,6 +33,8 @@ from llm_config import (
|
||||
USE_REAL_LLM,
|
||||
MAX_CONCURRENT_WORKERS,
|
||||
SHOW_DETAILED_PROGRESS,
|
||||
MAX_API_RETRIES,
|
||||
RETRY_DELAY,
|
||||
OPENAI_CONFIG,
|
||||
EVALUATION_WEIGHTS,
|
||||
PROMPT_TEMPLATE,
|
||||
@@ -377,7 +379,12 @@ def evaluate_single_item(args):
|
||||
if use_real_llm:
|
||||
try:
|
||||
prompt = evaluator.get_llm_evaluation_prompt(answer_text, output_text, input_text)
|
||||
llm_score, llm_reason = evaluator.call_llm_for_evaluation(prompt)
|
||||
# 使用配置文件中的重试参数
|
||||
llm_score, llm_reason = evaluator.call_llm_for_evaluation(
|
||||
prompt,
|
||||
max_retries=MAX_API_RETRIES,
|
||||
retry_delay=RETRY_DELAY
|
||||
)
|
||||
except Exception as e:
|
||||
# 静默处理错误,返回模拟评估结果
|
||||
llm_score, llm_reason = evaluator.get_mock_llm_evaluation(answer_text, output_text, input_text)
|
||||
|
||||
@@ -1,15 +1,32 @@
|
||||
# 核心依赖
|
||||
numpy>=1.21.0
|
||||
pandas>=1.3.0
|
||||
openpyxl>=3.0.0
|
||||
nltk>=3.8
|
||||
# Core dependencies for LLM Model Evaluation System
|
||||
# Python 3.7+ required
|
||||
|
||||
# Numerical computing
|
||||
numpy>=1.19.0
|
||||
|
||||
# Natural Language Processing
|
||||
nltk>=3.6.0
|
||||
jieba>=0.42.1
|
||||
tqdm>=4.64.0
|
||||
xlrd>=2.0.0
|
||||
|
||||
# LLM API支持(可选)
|
||||
openai>=1.0.0
|
||||
# Data manipulation and analysis
|
||||
pandas>=1.2.0
|
||||
|
||||
# 其他可选依赖
|
||||
# requests>=2.28.0 # 用于HTTP请求
|
||||
# matplotlib>=3.5.0 # 用于结果可视化(如需要)
|
||||
# Progress bars and utilities
|
||||
tqdm>=4.60.0
|
||||
|
||||
# HTTP requests
|
||||
requests>=2.25.0
|
||||
|
||||
# Optional: For enhanced visualization (uncomment if needed)
|
||||
# matplotlib>=3.3.0
|
||||
# seaborn>=0.11.0
|
||||
|
||||
# Optional: For advanced metrics (uncomment if needed)
|
||||
# scikit-learn>=0.24.0
|
||||
# scipy>=1.6.0
|
||||
|
||||
# Development dependencies (uncomment for development)
|
||||
# pytest>=6.0.0
|
||||
# pytest-cov>=2.10.0
|
||||
# black>=21.0.0
|
||||
# flake8>=3.8.0
|
||||
|
||||
Reference in New Issue
Block a user