11 KiB
LLM Model Evaluation System
A comprehensive evaluation system for Large Language Models (LLMs) with concurrent processing, batch progress tracking, and automatic retry mechanisms.
✨ Features
🚀 High-Performance Concurrent Processing
- True Concurrency: Utilizes ThreadPoolExecutor for real concurrent execution
- Configurable Workers: Set concurrent thread count via configuration
- Auto CPU Detection: Automatically uses all CPU cores by default
- Batch Processing: Processes data in batches for efficient resource utilization
📊 Intelligent Batch Progress Tracking
- Dynamic Progress Bars: Creates progress bars only for current batch
- Memory Efficient: Constant memory usage regardless of batch count
- Scalable: Supports 100K+ batches without performance degradation
- Auto Cleanup: Automatically closes progress bars after batch completion
🔄 Robust API Retry Mechanism
- Automatic Retry: Automatically retries failed API calls
- Exponential Backoff: Uses 2^n delay strategy to avoid API overload
- Configurable: Set retry count and delay via configuration file
- Smart Error Handling: Distinguishes retryable vs non-retryable errors
🌐 Flexible API Support
- HTTP-Based: Uses standard HTTP requests instead of vendor-specific SDKs
- Multi-API Compatible: Works with any OpenAI-compatible API endpoint
- No Vendor Lock-in: Supports custom, proxy, and self-hosted APIs
📈 Comprehensive Evaluation Metrics
- Traditional Metrics: BLEU, ROUGE-L, Exact Match, Keyword Overlap
- LLM-Based Evaluation: Semantic understanding via LLM scoring
- Combined Scoring: Weighted combination of multiple metrics
- Detailed Reports: Comprehensive evaluation reports with visualizations
📦 Installation
Prerequisites
- Python 3.7 or higher
- pip (Python package manager)
Install Dependencies
# Clone or download the repository
cd YG_LLM_Tester
# Install required packages
pip install -r requirements.txt
Manual Installation
If you prefer to install packages individually:
pip install numpy nltk jieba pandas tqdm requests
Note: Some NLTK data will be downloaded automatically on first use.
⚙️ Configuration
Basic Configuration (llm_config.py)
# Concurrent Processing
MAX_CONCURRENT_WORKERS = 4 # Number of concurrent threads
SHOW_DETAILED_PROGRESS = True # Show detailed progress bars
# API Retry Settings
MAX_API_RETRIES = 3 # Maximum retry attempts
RETRY_DELAY = 1.0 # Initial retry delay in seconds
# API Configuration
USE_REAL_LLM = False # True for real LLM API, False for simulation
OPENAI_CONFIG = {
"api_key": "your-api-key",
"api_base": "https://api.openai.com/v1",
"model": "gpt-3.5-turbo",
"temperature": 0,
"max_tokens": 500,
"timeout": 60
}
Environment Variables
You can also configure via environment variables:
export OPENAI_API_KEY="your-api-key"
export API_BASE_URL="https://your-api-endpoint/v1"
export USE_REAL_LLM="true"
🎯 Quick Start
Basic Usage
from model_evaluation import evaluate_dataset_parallel, ModelEvaluator
# Create evaluator
evaluator = ModelEvaluator()
# Prepare your data
data = [
{
'question': 'What is machine learning?',
'output': 'Machine learning is a technology that enables computers to learn from data',
'answer': 'Machine learning is a branch of artificial intelligence that allows computers to learn patterns from data'
},
# Add more data...
]
# Run evaluation (simulation mode)
results, metrics = evaluate_dataset_parallel(
data=data,
evaluator=evaluator,
use_real_llm=False, # Use simulation
max_workers=2 # Optional: override default workers
)
# Print results
print(f"Evaluation Results: {results}")
print(f"Overall Metrics: {metrics}")
Real LLM API Usage
# Enable real LLM API (requires API key configuration)
results, metrics = evaluate_dataset_parallel(
data=data,
evaluator=evaluator,
use_real_llm=True, # Use real LLM API
max_workers=4 # Recommended: 4-8 for real APIs
)
# API calls will automatically retry on failure
# using settings from llm_config.py
Custom Retry Configuration
# Get evaluation prompt
prompt = evaluator.get_llm_evaluation_prompt(
reference="Reference answer",
candidate="Model output",
question="Question"
)
# Use custom retry settings
score, reason = evaluator.call_llm_for_evaluation(
prompt,
max_retries=5, # Custom retry count
retry_delay=2.0 # Custom retry delay
)
📊 Understanding the Output
Progress Display
When running evaluations, you'll see progress bars:
总进度: 50%|█████ | 3/6 [00:00<00:00, 26.25it/s]
批次2-并发1: 任务3: 0%| | 0/1 [00:00<?, ?it/s]
批次2-并发2: 任务4: 0%| | 0/1 [00:00<?, ?it/s]
Evaluation Results
results = [
{
'index': 1,
'Input': 'What is AI?',
'Output': 'AI is artificial intelligence...',
'Answer': 'Artificial intelligence is...',
'bleu_score': 0.85,
'rouge_l_score': 0.90,
'exact_match_rate': 0.75,
'keyword_overlap_rate': 0.80,
'llm_score': 8,
'llm_reason': 'The answer is accurate and well-structured...'
}
]
metrics = {
'bleu_score': 0.85,
'rouge_l_score': 0.90,
'character_overlap_rate': 0.75,
'length_similarity': 0.80,
'exact_match_rate': 0.75,
'keyword_overlap_rate': 0.80,
'llm_score': 8.0
}
🧪 Testing
Run the included test scripts:
# Test batch progress bars
python quick_batch_test.py
# Test HTTP API functionality
python test_http_api.py
# Test retry mechanism
python test_retry_simple.py
# Test retry configuration
python test_retry_config.py
# Run comprehensive tests
python final_test.py
📖 Documentation
Core Components
- ModelEvaluator: Main evaluation class
- Configuration: All configuration parameters
- Batch Processing Guide: Detailed batch progress bar documentation
- Retry Mechanism Guide: Automatic retry mechanism documentation
- Retry Configuration Guide: Configuration management guide
Key Features Documentation
- Concurrent Processing: Complete Implementation Summary
- Batch Progress Bars: Batch Progress Guide
- HTTP API Migration: API Migration Report
- Retry Mechanism: Retry Mechanism Guide
🎛️ Advanced Configuration
Concurrent Processing Settings
# llm_config.py
MAX_CONCURRENT_WORKERS = None # Auto-detect CPU cores
# or
MAX_CONCURRENT_WORKERS = 8 # Manual setting
Recommendations:
- Simulation mode: Use all CPU cores
- Real API mode: 4-8 workers (avoid rate limits)
Progress Bar Settings
# llm_config.py
SHOW_DETAILED_PROGRESS = True # Show per-batch progress bars
Recommendations:
- Small datasets (< 20 items): Enable
- Large datasets (> 100 items): Disable for cleaner output
Retry Mechanism Settings
# llm_config.py
MAX_API_RETRIES = 3 # Number of retry attempts
RETRY_DELAY = 1.0 # Initial delay in seconds
Recommendations:
- Stable network: 1-2 retries, 0.5s delay
- Standard environment: 3 retries, 1.0s delay
- Unstable network: 5 retries, 2.0s delay
API Configuration
# llm_config.py
OPENAI_CONFIG = {
"api_key": os.environ.get("OPENAI_API_KEY", "your-key"),
"api_base": os.environ.get("API_BASE_URL", "https://api.openai.com/v1"),
"model": "gpt-3.5-turbo",
"temperature": 0,
"max_tokens": 500,
"timeout": 60
}
🔧 Troubleshooting
Common Issues
1. Import Errors
ImportError: No module named 'jieba'
Solution: Install missing dependencies
pip install -r requirements.txt
2. NLTK Data Issues
LookupError: Resource 'tokenizers/punkt' not found
Solution: NLTK will download data automatically on first use. If issues persist:
import nltk
nltk.download('punkt')
3. API Connection Errors
requests.exceptions.ConnectionError
Solution:
- Check API endpoint URL
- Verify API key is correct
- Check network connectivity
- Adjust retry settings for unstable networks
4. Memory Issues with Large Datasets
OutOfMemoryError
Solution:
- Disable detailed progress bars:
SHOW_DETAILED_PROGRESS = False - Reduce concurrent workers:
MAX_CONCURRENT_WORKERS = 2 - Process data in smaller chunks
5. Slow Performance
Solutions:
- Increase
MAX_CONCURRENT_WORKERSfor simulation mode - Disable detailed progress bars for large datasets
- Use simulation mode (
use_real_llm=False) for testing
Performance Tuning
For Large Datasets (10K+ items)
MAX_CONCURRENT_WORKERS = None # Use all CPU cores
SHOW_DETAILED_PROGRESS = False # Disable detailed bars
MAX_API_RETRIES = 2 # Reduce retries
For High-Throughput Testing
MAX_CONCURRENT_WORKERS = 8 # More workers
USE_REAL_LLM = False # Use simulation
SHOW_DETAILED_PROGRESS = False # Disable detailed bars
For Production with Real APIs
MAX_CONCURRENT_WORKERS = 4 # Balanced for API limits
USE_REAL_LLM = True # Use real API
MAX_API_RETRIES = 3 # Ensure reliability
RETRY_DELAY = 1.0 # Standard delay
🤝 Contributing
We welcome contributions! Please feel free to submit issues and enhancement requests.
Development Setup
- Fork the repository
- Create a virtual environment
- Install development dependencies
- Make your changes
- Run tests to ensure everything works
- Submit a pull request
Code Style
- Follow PEP 8 guidelines
- Add docstrings to new functions
- Include type hints where applicable
- Write tests for new features
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- NLTK team for natural language processing tools
- jieba team for Chinese text segmentation
- tqdm team for progress bar functionality
- OpenAI for LLM API inspiration
📞 Support
For support, please:
- Check the Troubleshooting section
- Review the documentation files
- Run the test scripts to verify your setup
- Open an issue with detailed error information
🗺️ Roadmap
- Support for more evaluation metrics (BERTScore, METEOR)
- Integration with wandb for experiment tracking
- Web-based dashboard for result visualization
- Support for multilingual evaluation
- Batch evaluation with different models
- Export results to various formats (JSON, CSV, Excel)
📈 Version History
- v5.2 (Current): Added configurable retry mechanism
- v5.1: Implemented HTTP API with retry logic
- v5.0: Batch progress tracking with dynamic bars
- v4.0: Concurrent processing implementation
- v3.0: LLM evaluation integration
- v2.0: Traditional metric evaluation
- v1.0: Initial release
Made with ❤️ for the AI community