# LLM Model Evaluation System [![Python Version](https://img.shields.io/badge/python-3.7%2B-blue.svg)](https://www.python.org/downloads/) [![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE) A comprehensive evaluation system for Large Language Models (LLMs) with concurrent processing, batch progress tracking, and automatic retry mechanisms. ## ✨ Features ### πŸš€ High-Performance Concurrent Processing - **True Concurrency**: Utilizes ThreadPoolExecutor for real concurrent execution - **Configurable Workers**: Set concurrent thread count via configuration - **Auto CPU Detection**: Automatically uses all CPU cores by default - **Batch Processing**: Processes data in batches for efficient resource utilization ### πŸ“Š Intelligent Batch Progress Tracking - **Dynamic Progress Bars**: Creates progress bars only for current batch - **Memory Efficient**: Constant memory usage regardless of batch count - **Scalable**: Supports 100K+ batches without performance degradation - **Auto Cleanup**: Automatically closes progress bars after batch completion ### πŸ”„ Robust API Retry Mechanism - **Automatic Retry**: Automatically retries failed API calls - **Exponential Backoff**: Uses 2^n delay strategy to avoid API overload - **Configurable**: Set retry count and delay via configuration file - **Smart Error Handling**: Distinguishes retryable vs non-retryable errors ### 🌐 Flexible API Support - **HTTP-Based**: Uses standard HTTP requests instead of vendor-specific SDKs - **Multi-API Compatible**: Works with any OpenAI-compatible API endpoint - **No Vendor Lock-in**: Supports custom, proxy, and self-hosted APIs ### πŸ“ˆ Comprehensive Evaluation Metrics - **Traditional Metrics**: BLEU, ROUGE-L, Exact Match, Keyword Overlap - **LLM-Based Evaluation**: Semantic understanding via LLM scoring - **Combined Scoring**: Weighted combination of multiple metrics - **Detailed Reports**: Comprehensive evaluation reports with visualizations ## πŸ“¦ Installation ### Prerequisites - Python 3.7 or higher - pip (Python package manager) ### Install Dependencies ```bash # Clone or download the repository cd YG_LLM_Tester # Install required packages pip install -r requirements.txt ``` ### Manual Installation If you prefer to install packages individually: ```bash pip install numpy nltk jieba pandas tqdm requests ``` **Note**: Some NLTK data will be downloaded automatically on first use. ## βš™οΈ Configuration ### Basic Configuration (llm_config.py) ```python # Concurrent Processing MAX_CONCURRENT_WORKERS = 4 # Number of concurrent threads SHOW_DETAILED_PROGRESS = True # Show detailed progress bars # API Retry Settings MAX_API_RETRIES = 3 # Maximum retry attempts RETRY_DELAY = 1.0 # Initial retry delay in seconds # API Configuration USE_REAL_LLM = False # True for real LLM API, False for simulation OPENAI_CONFIG = { "api_key": "your-api-key", "api_base": "https://api.openai.com/v1", "model": "gpt-3.5-turbo", "temperature": 0, "max_tokens": 500, "timeout": 60 } ``` ### Environment Variables You can also configure via environment variables: ```bash export OPENAI_API_KEY="your-api-key" export API_BASE_URL="https://your-api-endpoint/v1" export USE_REAL_LLM="true" ``` ## 🎯 Quick Start ### Basic Usage ```python from model_evaluation import evaluate_dataset_parallel, ModelEvaluator # Create evaluator evaluator = ModelEvaluator() # Prepare your data data = [ { 'question': 'What is machine learning?', 'output': 'Machine learning is a technology that enables computers to learn from data', 'answer': 'Machine learning is a branch of artificial intelligence that allows computers to learn patterns from data' }, # Add more data... ] # Run evaluation (simulation mode) results, metrics = evaluate_dataset_parallel( data=data, evaluator=evaluator, use_real_llm=False, # Use simulation max_workers=2 # Optional: override default workers ) # Print results print(f"Evaluation Results: {results}") print(f"Overall Metrics: {metrics}") ``` ### Real LLM API Usage ```python # Enable real LLM API (requires API key configuration) results, metrics = evaluate_dataset_parallel( data=data, evaluator=evaluator, use_real_llm=True, # Use real LLM API max_workers=4 # Recommended: 4-8 for real APIs ) # API calls will automatically retry on failure # using settings from llm_config.py ``` ### Custom Retry Configuration ```python # Get evaluation prompt prompt = evaluator.get_llm_evaluation_prompt( reference="Reference answer", candidate="Model output", question="Question" ) # Use custom retry settings score, reason = evaluator.call_llm_for_evaluation( prompt, max_retries=5, # Custom retry count retry_delay=2.0 # Custom retry delay ) ``` ## πŸ“Š Understanding the Output ### Progress Display When running evaluations, you'll see progress bars: ``` ζ€»θΏ›εΊ¦: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 3/6 [00:00<00:00, 26.25it/s] 批欑2-幢发1: 任劑3: 0%| | 0/1 [00:00 100 items): Disable for cleaner output ### Retry Mechanism Settings ```python # llm_config.py MAX_API_RETRIES = 3 # Number of retry attempts RETRY_DELAY = 1.0 # Initial delay in seconds ``` **Recommendations**: - Stable network: 1-2 retries, 0.5s delay - Standard environment: 3 retries, 1.0s delay - Unstable network: 5 retries, 2.0s delay ### API Configuration ```python # llm_config.py OPENAI_CONFIG = { "api_key": os.environ.get("OPENAI_API_KEY", "your-key"), "api_base": os.environ.get("API_BASE_URL", "https://api.openai.com/v1"), "model": "gpt-3.5-turbo", "temperature": 0, "max_tokens": 500, "timeout": 60 } ``` ## πŸ”§ Troubleshooting ### Common Issues #### 1. Import Errors ``` ImportError: No module named 'jieba' ``` **Solution**: Install missing dependencies ```bash pip install -r requirements.txt ``` #### 2. NLTK Data Issues ``` LookupError: Resource 'tokenizers/punkt' not found ``` **Solution**: NLTK will download data automatically on first use. If issues persist: ```python import nltk nltk.download('punkt') ``` #### 3. API Connection Errors ``` requests.exceptions.ConnectionError ``` **Solution**: - Check API endpoint URL - Verify API key is correct - Check network connectivity - Adjust retry settings for unstable networks #### 4. Memory Issues with Large Datasets ``` OutOfMemoryError ``` **Solution**: - Disable detailed progress bars: `SHOW_DETAILED_PROGRESS = False` - Reduce concurrent workers: `MAX_CONCURRENT_WORKERS = 2` - Process data in smaller chunks #### 5. Slow Performance **Solutions**: - Increase `MAX_CONCURRENT_WORKERS` for simulation mode - Disable detailed progress bars for large datasets - Use simulation mode (`use_real_llm=False`) for testing ### Performance Tuning #### For Large Datasets (10K+ items) ```python MAX_CONCURRENT_WORKERS = None # Use all CPU cores SHOW_DETAILED_PROGRESS = False # Disable detailed bars MAX_API_RETRIES = 2 # Reduce retries ``` #### For High-Throughput Testing ```python MAX_CONCURRENT_WORKERS = 8 # More workers USE_REAL_LLM = False # Use simulation SHOW_DETAILED_PROGRESS = False # Disable detailed bars ``` #### For Production with Real APIs ```python MAX_CONCURRENT_WORKERS = 4 # Balanced for API limits USE_REAL_LLM = True # Use real API MAX_API_RETRIES = 3 # Ensure reliability RETRY_DELAY = 1.0 # Standard delay ``` ## 🀝 Contributing We welcome contributions! Please feel free to submit issues and enhancement requests. ### Development Setup 1. Fork the repository 2. Create a virtual environment 3. Install development dependencies 4. Make your changes 5. Run tests to ensure everything works 6. Submit a pull request ### Code Style - Follow PEP 8 guidelines - Add docstrings to new functions - Include type hints where applicable - Write tests for new features ## πŸ“„ License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## πŸ™ Acknowledgments - NLTK team for natural language processing tools - jieba team for Chinese text segmentation - tqdm team for progress bar functionality - OpenAI for LLM API inspiration ## πŸ“ž Support For support, please: 1. Check the [Troubleshooting](#troubleshooting) section 2. Review the documentation files 3. Run the test scripts to verify your setup 4. Open an issue with detailed error information ## πŸ—ΊοΈ Roadmap - [ ] Support for more evaluation metrics (BERTScore, METEOR) - [ ] Integration with wandb for experiment tracking - [ ] Web-based dashboard for result visualization - [ ] Support for multilingual evaluation - [ ] Batch evaluation with different models - [ ] Export results to various formats (JSON, CSV, Excel) ## πŸ“ˆ Version History - **v5.2** (Current): Added configurable retry mechanism - **v5.1**: Implemented HTTP API with retry logic - **v5.0**: Batch progress tracking with dynamic bars - **v4.0**: Concurrent processing implementation - **v3.0**: LLM evaluation integration - **v2.0**: Traditional metric evaluation - **v1.0**: Initial release --- **Made with ❀️ for the AI community**