431 lines
11 KiB
Markdown
431 lines
11 KiB
Markdown
# LLM Model Evaluation System
|
|
|
|
[](https://www.python.org/downloads/)
|
|
[](LICENSE)
|
|
|
|
A comprehensive evaluation system for Large Language Models (LLMs) with concurrent processing, batch progress tracking, and automatic retry mechanisms.
|
|
|
|
## ✨ Features
|
|
|
|
### 🚀 High-Performance Concurrent Processing
|
|
- **True Concurrency**: Utilizes ThreadPoolExecutor for real concurrent execution
|
|
- **Configurable Workers**: Set concurrent thread count via configuration
|
|
- **Auto CPU Detection**: Automatically uses all CPU cores by default
|
|
- **Batch Processing**: Processes data in batches for efficient resource utilization
|
|
|
|
### 📊 Intelligent Batch Progress Tracking
|
|
- **Dynamic Progress Bars**: Creates progress bars only for current batch
|
|
- **Memory Efficient**: Constant memory usage regardless of batch count
|
|
- **Scalable**: Supports 100K+ batches without performance degradation
|
|
- **Auto Cleanup**: Automatically closes progress bars after batch completion
|
|
|
|
### 🔄 Robust API Retry Mechanism
|
|
- **Automatic Retry**: Automatically retries failed API calls
|
|
- **Exponential Backoff**: Uses 2^n delay strategy to avoid API overload
|
|
- **Configurable**: Set retry count and delay via configuration file
|
|
- **Smart Error Handling**: Distinguishes retryable vs non-retryable errors
|
|
|
|
### 🌐 Flexible API Support
|
|
- **HTTP-Based**: Uses standard HTTP requests instead of vendor-specific SDKs
|
|
- **Multi-API Compatible**: Works with any OpenAI-compatible API endpoint
|
|
- **No Vendor Lock-in**: Supports custom, proxy, and self-hosted APIs
|
|
|
|
### 📈 Comprehensive Evaluation Metrics
|
|
- **Traditional Metrics**: BLEU, ROUGE-L, Exact Match, Keyword Overlap
|
|
- **LLM-Based Evaluation**: Semantic understanding via LLM scoring
|
|
- **Combined Scoring**: Weighted combination of multiple metrics
|
|
- **Detailed Reports**: Comprehensive evaluation reports with visualizations
|
|
|
|
## 📦 Installation
|
|
|
|
### Prerequisites
|
|
- Python 3.7 or higher
|
|
- pip (Python package manager)
|
|
|
|
### Install Dependencies
|
|
|
|
```bash
|
|
# Clone or download the repository
|
|
cd YG_LLM_Tester
|
|
|
|
# Install required packages
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### Manual Installation
|
|
|
|
If you prefer to install packages individually:
|
|
|
|
```bash
|
|
pip install numpy nltk jieba pandas tqdm requests
|
|
```
|
|
|
|
**Note**: Some NLTK data will be downloaded automatically on first use.
|
|
|
|
## ⚙️ Configuration
|
|
|
|
### Basic Configuration (llm_config.py)
|
|
|
|
```python
|
|
# Concurrent Processing
|
|
MAX_CONCURRENT_WORKERS = 4 # Number of concurrent threads
|
|
SHOW_DETAILED_PROGRESS = True # Show detailed progress bars
|
|
|
|
# API Retry Settings
|
|
MAX_API_RETRIES = 3 # Maximum retry attempts
|
|
RETRY_DELAY = 1.0 # Initial retry delay in seconds
|
|
|
|
# API Configuration
|
|
USE_REAL_LLM = False # True for real LLM API, False for simulation
|
|
OPENAI_CONFIG = {
|
|
"api_key": "your-api-key",
|
|
"api_base": "https://api.openai.com/v1",
|
|
"model": "gpt-3.5-turbo",
|
|
"temperature": 0,
|
|
"max_tokens": 500,
|
|
"timeout": 60
|
|
}
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
You can also configure via environment variables:
|
|
|
|
```bash
|
|
export OPENAI_API_KEY="your-api-key"
|
|
export API_BASE_URL="https://your-api-endpoint/v1"
|
|
export USE_REAL_LLM="true"
|
|
```
|
|
|
|
## 🎯 Quick Start
|
|
|
|
### Basic Usage
|
|
|
|
```python
|
|
from model_evaluation import evaluate_dataset_parallel, ModelEvaluator
|
|
|
|
# Create evaluator
|
|
evaluator = ModelEvaluator()
|
|
|
|
# Prepare your data
|
|
data = [
|
|
{
|
|
'question': 'What is machine learning?',
|
|
'output': 'Machine learning is a technology that enables computers to learn from data',
|
|
'answer': 'Machine learning is a branch of artificial intelligence that allows computers to learn patterns from data'
|
|
},
|
|
# Add more data...
|
|
]
|
|
|
|
# Run evaluation (simulation mode)
|
|
results, metrics = evaluate_dataset_parallel(
|
|
data=data,
|
|
evaluator=evaluator,
|
|
use_real_llm=False, # Use simulation
|
|
max_workers=2 # Optional: override default workers
|
|
)
|
|
|
|
# Print results
|
|
print(f"Evaluation Results: {results}")
|
|
print(f"Overall Metrics: {metrics}")
|
|
```
|
|
|
|
### Real LLM API Usage
|
|
|
|
```python
|
|
# Enable real LLM API (requires API key configuration)
|
|
results, metrics = evaluate_dataset_parallel(
|
|
data=data,
|
|
evaluator=evaluator,
|
|
use_real_llm=True, # Use real LLM API
|
|
max_workers=4 # Recommended: 4-8 for real APIs
|
|
)
|
|
|
|
# API calls will automatically retry on failure
|
|
# using settings from llm_config.py
|
|
```
|
|
|
|
### Custom Retry Configuration
|
|
|
|
```python
|
|
# Get evaluation prompt
|
|
prompt = evaluator.get_llm_evaluation_prompt(
|
|
reference="Reference answer",
|
|
candidate="Model output",
|
|
question="Question"
|
|
)
|
|
|
|
# Use custom retry settings
|
|
score, reason = evaluator.call_llm_for_evaluation(
|
|
prompt,
|
|
max_retries=5, # Custom retry count
|
|
retry_delay=2.0 # Custom retry delay
|
|
)
|
|
```
|
|
|
|
## 📊 Understanding the Output
|
|
|
|
### Progress Display
|
|
|
|
When running evaluations, you'll see progress bars:
|
|
|
|
```
|
|
总进度: 50%|█████ | 3/6 [00:00<00:00, 26.25it/s]
|
|
批次2-并发1: 任务3: 0%| | 0/1 [00:00<?, ?it/s]
|
|
批次2-并发2: 任务4: 0%| | 0/1 [00:00<?, ?it/s]
|
|
```
|
|
|
|
### Evaluation Results
|
|
|
|
```python
|
|
results = [
|
|
{
|
|
'index': 1,
|
|
'Input': 'What is AI?',
|
|
'Output': 'AI is artificial intelligence...',
|
|
'Answer': 'Artificial intelligence is...',
|
|
'bleu_score': 0.85,
|
|
'rouge_l_score': 0.90,
|
|
'exact_match_rate': 0.75,
|
|
'keyword_overlap_rate': 0.80,
|
|
'llm_score': 8,
|
|
'llm_reason': 'The answer is accurate and well-structured...'
|
|
}
|
|
]
|
|
|
|
metrics = {
|
|
'bleu_score': 0.85,
|
|
'rouge_l_score': 0.90,
|
|
'character_overlap_rate': 0.75,
|
|
'length_similarity': 0.80,
|
|
'exact_match_rate': 0.75,
|
|
'keyword_overlap_rate': 0.80,
|
|
'llm_score': 8.0
|
|
}
|
|
```
|
|
|
|
## 🧪 Testing
|
|
|
|
Run the included test scripts:
|
|
|
|
```bash
|
|
# Test batch progress bars
|
|
python quick_batch_test.py
|
|
|
|
# Test HTTP API functionality
|
|
python test_http_api.py
|
|
|
|
# Test retry mechanism
|
|
python test_retry_simple.py
|
|
|
|
# Test retry configuration
|
|
python test_retry_config.py
|
|
|
|
# Run comprehensive tests
|
|
python final_test.py
|
|
```
|
|
|
|
## 📖 Documentation
|
|
|
|
### Core Components
|
|
|
|
- **[ModelEvaluator](model_evaluation.py)**: Main evaluation class
|
|
- **[Configuration](llm_config.py)**: All configuration parameters
|
|
- **[Batch Processing Guide](BATCH_PROGRESS_GUIDE.md)**: Detailed batch progress bar documentation
|
|
- **[Retry Mechanism Guide](RETRY_MECHANISM_GUIDE.md)**: Automatic retry mechanism documentation
|
|
- **[Retry Configuration Guide](RETRY_CONFIG_README.md)**: Configuration management guide
|
|
|
|
### Key Features Documentation
|
|
|
|
- **Concurrent Processing**: [Complete Implementation Summary](COMPLETE_IMPLEMENTATION_SUMMARY.md)
|
|
- **Batch Progress Bars**: [Batch Progress Guide](BATCH_PROGRESS_GUIDE.md)
|
|
- **HTTP API Migration**: [API Migration Report](HTTP_API_MIGRATION_REPORT.md)
|
|
- **Retry Mechanism**: [Retry Mechanism Guide](RETRY_MECHANISM_GUIDE.md)
|
|
|
|
## 🎛️ Advanced Configuration
|
|
|
|
### Concurrent Processing Settings
|
|
|
|
```python
|
|
# llm_config.py
|
|
MAX_CONCURRENT_WORKERS = None # Auto-detect CPU cores
|
|
# or
|
|
MAX_CONCURRENT_WORKERS = 8 # Manual setting
|
|
```
|
|
|
|
**Recommendations**:
|
|
- Simulation mode: Use all CPU cores
|
|
- Real API mode: 4-8 workers (avoid rate limits)
|
|
|
|
### Progress Bar Settings
|
|
|
|
```python
|
|
# llm_config.py
|
|
SHOW_DETAILED_PROGRESS = True # Show per-batch progress bars
|
|
```
|
|
|
|
**Recommendations**:
|
|
- Small datasets (< 20 items): Enable
|
|
- Large datasets (> 100 items): Disable for cleaner output
|
|
|
|
### Retry Mechanism Settings
|
|
|
|
```python
|
|
# llm_config.py
|
|
MAX_API_RETRIES = 3 # Number of retry attempts
|
|
RETRY_DELAY = 1.0 # Initial delay in seconds
|
|
```
|
|
|
|
**Recommendations**:
|
|
- Stable network: 1-2 retries, 0.5s delay
|
|
- Standard environment: 3 retries, 1.0s delay
|
|
- Unstable network: 5 retries, 2.0s delay
|
|
|
|
### API Configuration
|
|
|
|
```python
|
|
# llm_config.py
|
|
OPENAI_CONFIG = {
|
|
"api_key": os.environ.get("OPENAI_API_KEY", "your-key"),
|
|
"api_base": os.environ.get("API_BASE_URL", "https://api.openai.com/v1"),
|
|
"model": "gpt-3.5-turbo",
|
|
"temperature": 0,
|
|
"max_tokens": 500,
|
|
"timeout": 60
|
|
}
|
|
```
|
|
|
|
## 🔧 Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### 1. Import Errors
|
|
```
|
|
ImportError: No module named 'jieba'
|
|
```
|
|
**Solution**: Install missing dependencies
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
#### 2. NLTK Data Issues
|
|
```
|
|
LookupError: Resource 'tokenizers/punkt' not found
|
|
```
|
|
**Solution**: NLTK will download data automatically on first use. If issues persist:
|
|
```python
|
|
import nltk
|
|
nltk.download('punkt')
|
|
```
|
|
|
|
#### 3. API Connection Errors
|
|
```
|
|
requests.exceptions.ConnectionError
|
|
```
|
|
**Solution**:
|
|
- Check API endpoint URL
|
|
- Verify API key is correct
|
|
- Check network connectivity
|
|
- Adjust retry settings for unstable networks
|
|
|
|
#### 4. Memory Issues with Large Datasets
|
|
```
|
|
OutOfMemoryError
|
|
```
|
|
**Solution**:
|
|
- Disable detailed progress bars: `SHOW_DETAILED_PROGRESS = False`
|
|
- Reduce concurrent workers: `MAX_CONCURRENT_WORKERS = 2`
|
|
- Process data in smaller chunks
|
|
|
|
#### 5. Slow Performance
|
|
**Solutions**:
|
|
- Increase `MAX_CONCURRENT_WORKERS` for simulation mode
|
|
- Disable detailed progress bars for large datasets
|
|
- Use simulation mode (`use_real_llm=False`) for testing
|
|
|
|
### Performance Tuning
|
|
|
|
#### For Large Datasets (10K+ items)
|
|
```python
|
|
MAX_CONCURRENT_WORKERS = None # Use all CPU cores
|
|
SHOW_DETAILED_PROGRESS = False # Disable detailed bars
|
|
MAX_API_RETRIES = 2 # Reduce retries
|
|
```
|
|
|
|
#### For High-Throughput Testing
|
|
```python
|
|
MAX_CONCURRENT_WORKERS = 8 # More workers
|
|
USE_REAL_LLM = False # Use simulation
|
|
SHOW_DETAILED_PROGRESS = False # Disable detailed bars
|
|
```
|
|
|
|
#### For Production with Real APIs
|
|
```python
|
|
MAX_CONCURRENT_WORKERS = 4 # Balanced for API limits
|
|
USE_REAL_LLM = True # Use real API
|
|
MAX_API_RETRIES = 3 # Ensure reliability
|
|
RETRY_DELAY = 1.0 # Standard delay
|
|
```
|
|
|
|
## 🤝 Contributing
|
|
|
|
We welcome contributions! Please feel free to submit issues and enhancement requests.
|
|
|
|
### Development Setup
|
|
|
|
1. Fork the repository
|
|
2. Create a virtual environment
|
|
3. Install development dependencies
|
|
4. Make your changes
|
|
5. Run tests to ensure everything works
|
|
6. Submit a pull request
|
|
|
|
### Code Style
|
|
|
|
- Follow PEP 8 guidelines
|
|
- Add docstrings to new functions
|
|
- Include type hints where applicable
|
|
- Write tests for new features
|
|
|
|
## 📄 License
|
|
|
|
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
|
|
## 🙏 Acknowledgments
|
|
|
|
- NLTK team for natural language processing tools
|
|
- jieba team for Chinese text segmentation
|
|
- tqdm team for progress bar functionality
|
|
- OpenAI for LLM API inspiration
|
|
|
|
## 📞 Support
|
|
|
|
For support, please:
|
|
1. Check the [Troubleshooting](#troubleshooting) section
|
|
2. Review the documentation files
|
|
3. Run the test scripts to verify your setup
|
|
4. Open an issue with detailed error information
|
|
|
|
## 🗺️ Roadmap
|
|
|
|
- [ ] Support for more evaluation metrics (BERTScore, METEOR)
|
|
- [ ] Integration with wandb for experiment tracking
|
|
- [ ] Web-based dashboard for result visualization
|
|
- [ ] Support for multilingual evaluation
|
|
- [ ] Batch evaluation with different models
|
|
- [ ] Export results to various formats (JSON, CSV, Excel)
|
|
|
|
## 📈 Version History
|
|
|
|
- **v5.2** (Current): Added configurable retry mechanism
|
|
- **v5.1**: Implemented HTTP API with retry logic
|
|
- **v5.0**: Batch progress tracking with dynamic bars
|
|
- **v4.0**: Concurrent processing implementation
|
|
- **v3.0**: LLM evaluation integration
|
|
- **v2.0**: Traditional metric evaluation
|
|
- **v1.0**: Initial release
|
|
|
|
---
|
|
|
|
**Made with ❤️ for the AI community**
|