Enterprise-Grade Python Framework for Large Language Model Evaluation & Testing
Project description
๐ LLM Evaluation Framework
๐ Enterprise-Grade Python Framework for Large Language Model Evaluation & Testing ๐
Built with production-ready standards โข Type-safe โข Comprehensive testing โข Full CLI support
๐ Documentation โข ๏ฟฝ Quick Start โข ๐ก Examples โข ๐ Report Issues
๐ What Makes This Special?
๐ฏ Production Ready
โก High Performance
|
๏ฟฝ๏ธ Developer Friendly
๐ Rich Analytics
|
๏ฟฝ Quick Installation
# Install from PyPI (Recommended)
pip install LLMEvaluationFramework
# Or install from source for latest features
git clone https://github.com/isathish/LLMEvaluationFramework.git
cd LLMEvaluationFramework
pip install -e .
Requirements: Python 3.8+ โข No external dependencies for core functionality
โก Quick Start
๐ Python API (Recommended)
from llm_evaluation_framework import (
ModelRegistry,
ModelInferenceEngine,
TestDatasetGenerator
)
# 1๏ธโฃ Setup the registry and register your model
registry = ModelRegistry()
registry.register_model("gpt-3.5-turbo", {
"provider": "openai",
"api_cost_input": 0.0015,
"api_cost_output": 0.002,
"capabilities": ["reasoning", "creativity", "coding"]
})
# 2๏ธโฃ Generate test cases
generator = TestDatasetGenerator()
test_cases = generator.generate_test_cases(
use_case={"domain": "general", "required_capabilities": ["reasoning"]},
count=10
)
# 3๏ธโฃ Run evaluation
engine = ModelInferenceEngine(registry)
results = engine.evaluate_model("gpt-3.5-turbo", test_cases)
# 4๏ธโฃ Analyze results
print(f"โ
Accuracy: {results['aggregate_metrics']['accuracy']:.1%}")
print(f"๐ฐ Total Cost: ${results['aggregate_metrics']['total_cost']:.4f}")
print(f"โฑ๏ธ Total Time: {results['aggregate_metrics']['total_time']:.2f}s")
๐ฅ๏ธ Command Line Interface
# Evaluate a model with specific capabilities
llm-eval evaluate --model gpt-3.5-turbo --test-cases 10 --capability reasoning
# Generate a custom test dataset
llm-eval generate --capability coding --count 20 --output my_dataset.json
# Score predictions against references
llm-eval score --predictions "Hello world" "Good morning" \
--references "Hello world" "Good evening" \
--metric accuracy
# List available capabilities and models
llm-eval list
๏ฟฝ๏ธ Core Architecture
graph TB
CLI[๐ฅ๏ธ CLI Interface<br/>llm-eval] --> Engine[โ๏ธ Inference Engine<br/>ModelInferenceEngine]
Engine --> Registry[๐๏ธ Model Registry<br/>ModelRegistry]
Engine --> Generator[๐งช Dataset Generator<br/>TestDatasetGenerator]
Engine --> Scoring[๐ Scoring Strategies<br/>AccuracyScoringStrategy]
Registry --> Models[(๐ค Models<br/>gpt-3.5-turbo, gpt-4, etc.)]
Engine --> Storage[๐พ Persistence Layer]
Storage --> JSON[๐ JSON Store]
Storage --> SQLite[๐๏ธ SQLite Store]
Engine --> Utils[๐ ๏ธ Utilities]
Utils --> Logger[๐ Advanced Logging]
Utils --> ErrorHandler[๐ก๏ธ Error Handling]
Utils --> AutoSuggest[๐ก Auto Suggestions]
๐ฏ Core Components
| Component | Description | Key Features |
|---|---|---|
| ๐ฅ Inference Engine | Execute and evaluate LLM inferences | Async processing, cost tracking, batch operations |
| ๐๏ธ Model Registry | Centralized model management | Multi-provider support, configuration management |
| ๐งช Dataset Generator | Create synthetic test cases | Capability-based generation, domain-specific tests |
| ๐ Scoring Strategies | Multiple evaluation metrics | Accuracy, F1-score, custom metrics |
| ๐พ Persistence Layer | Dual storage backends | JSON files, SQLite database with querying |
| ๐ก๏ธ Error Handling | Robust error management | Custom exceptions, retry mechanisms |
| ๐ Logging System | Advanced logging capabilities | File rotation, structured logging |
๐ฏ Feature Highlights
๐ What You Can Do
๐ฌ Research & Benchmarking
|
๐ข Enterprise Integration
|
๐ฐ Cost Management
|
๐ Supported Capabilities
# Available evaluation capabilities
CAPABILITIES = [
"reasoning", # Logical reasoning and problem-solving
"creativity", # Creative writing and ideation
"factual", # Factual accuracy and knowledge
"instruction", # Instruction following
"coding" # Code generation and debugging
]
๐ฎ Interactive Examples
๐ Click to see Advanced Usage Examples
๐ Batch Evaluation with Multiple Models
from llm_evaluation_framework import ModelRegistry, ModelInferenceEngine
from llm_evaluation_framework.persistence import JSONStore
# Setup multiple models
registry = ModelRegistry()
models = {
"gpt-3.5-turbo": {"provider": "openai", "cost_input": 0.0015},
"gpt-4": {"provider": "openai", "cost_input": 0.03},
"claude-3": {"provider": "anthropic", "cost_input": 0.015}
}
for name, config in models.items():
registry.register_model(name, config)
# Run comparative evaluation
engine = ModelInferenceEngine(registry)
results = {}
for model_name in models.keys():
print(f"๐ Evaluating {model_name}...")
result = engine.evaluate_model(model_name, test_cases)
results[model_name] = result
# Save results
store = JSONStore(f"results_{model_name}.json")
store.save_evaluation_result(result)
# Compare results
for model, result in results.items():
accuracy = result['aggregate_metrics']['accuracy']
cost = result['aggregate_metrics']['total_cost']
print(f"๐ {model}: {accuracy:.1%} accuracy, ${cost:.4f} cost")
๐ฏ Custom Scoring Strategy
from llm_evaluation_framework.evaluation.scoring_strategies import ScoringContext
class CustomCosineSimilarityStrategy:
"""Custom scoring using cosine similarity."""
def calculate_score(self, predictions, references):
# Your custom scoring logic here
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(predictions + references)
pred_vectors = vectors[:len(predictions)]
ref_vectors = vectors[len(predictions):]
similarities = cosine_similarity(pred_vectors, ref_vectors)
return similarities.diagonal().mean()
# Use custom strategy
custom_strategy = CustomCosineSimilarityStrategy()
context = ScoringContext(custom_strategy)
score = context.evaluate(predictions, references)
print(f"๐ฏ Custom similarity score: {score:.3f}")
๐ Async Evaluation Pipeline
import asyncio
from llm_evaluation_framework.engines.async_inference_engine import AsyncInferenceEngine
async def run_async_evaluation():
"""Run multiple evaluations concurrently."""
async_engine = AsyncInferenceEngine(registry)
# Define multiple evaluation tasks
tasks = []
for capability in ["reasoning", "creativity", "coding"]:
task = async_engine.evaluate_async(
model_name="gpt-3.5-turbo",
test_cases=test_cases,
capability=capability
)
tasks.append(task)
# Run all evaluations concurrently
results = await asyncio.gather(*tasks)
# Process results
for i, result in enumerate(results):
capability = ["reasoning", "creativity", "coding"][i]
accuracy = result['aggregate_metrics']['accuracy']
print(f"โ
{capability}: {accuracy:.1%}")
# Run async evaluation
asyncio.run(run_async_evaluation())
๐ Documentation & Resources
| Section | Description | Link |
|---|---|---|
| ๐ Getting Started | Installation, quick start, and basic concepts | View Guide |
| ๐ง Core Concepts | Understanding the framework architecture | Learn More |
| ๐ฅ๏ธ CLI Usage | Complete command-line interface documentation | CLI Guide |
| ๐ API Reference | Detailed API documentation with examples | API Docs |
| ๐ก Examples | Practical examples and tutorials | View Examples |
| ๐ ๏ธ Developer Guide | Contributing guidelines and development setup | Dev Guide |
๐งช Testing & Quality
๐ High-Quality Codebase with Comprehensive Testing
|
๐ Test Coverage
|
โ
Total Tests
|
๐ง Test Files
|
โก Test Types
|
๐ Run Tests Locally
# Run all tests
pytest
# Run with detailed coverage report
pytest --cov=llm_evaluation_framework --cov-report=html
# Run specific test categories
pytest tests/test_model_inference_engine_comprehensive.py # Core engine tests
pytest tests/test_cli_comprehensive.py # CLI tests
pytest tests/test_persistence_comprehensive.py # Storage tests
# View coverage report
open htmlcov/index.html
๐ Test Categories
| Test Type | Count | Description |
|---|---|---|
| ๐ง Unit Tests | 150+ | Individual component testing |
| ๐ Integration Tests | 40+ | Component interaction testing |
| ๐ฏ Edge Case Tests | 20+ | Error conditions and boundaries |
| โก Performance Tests | 10+ | Speed and memory optimization |
๐ค Contributing
๐ ๏ธ Development Setup
# 1๏ธโฃ Fork and clone the repository
git clone https://github.com/YOUR_USERNAME/LLMEvaluationFramework.git
cd LLMEvaluationFramework
# 2๏ธโฃ Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3๏ธโฃ Install in development mode
pip install -e ".[dev]"
# 4๏ธโฃ Run tests to ensure everything works
pytest
# 5๏ธโฃ Install pre-commit hooks (optional but recommended)
pre-commit install
๐ Contribution Guidelines
- ๐ด Fork the repository
- ๐ฟ Create a feature branch (
git checkout -b feature/amazing-feature) - โ Write tests for your changes
- ๐งช Run the test suite (
pytest) - ๐ Commit your changes (
git commit -m 'Add amazing feature') - ๐ Push to the branch (
git push origin feature/amazing-feature) - ๐ Open a Pull Request
๐ฏ What We're Looking For
- ๐ Bug fixes and improvements
- ๐ Documentation enhancements
- โจ New features and capabilities
- ๐งช Additional test cases
- ๐จ UI/UX improvements for CLI
- ๐ง Performance optimizations
๐ Requirements & Compatibility
๐ Python Version Support
| Python Version | Status | Notes |
|---|---|---|
| Python 3.8 | โ Supported | Minimum required version |
| Python 3.9 | โ Supported | Fully tested |
| Python 3.10 | โ Supported | Recommended |
| Python 3.11 | โ Supported | Latest features |
| Python 3.12+ | โ Supported | Future-ready |
๐ฆ Dependencies
# Core dependencies (automatically installed)
REQUIRED = [
# No external dependencies for core functionality!
# Framework uses only Python standard library
]
# Optional development dependencies
DEVELOPMENT = [
"pytest>=7.0.0", # Testing framework
"pytest-cov>=4.0.0", # Coverage reporting
"black>=22.0.0", # Code formatting
"flake8>=5.0.0", # Code linting
"mypy>=1.0.0", # Type checking
"pre-commit>=2.20.0", # Git hooks
]
๐ Platform Support
- โ Linux (Ubuntu, CentOS, RHEL)
- โ macOS (Intel & Apple Silicon)
- โ Windows (10, 11)
- โ Docker containers
- โ CI/CD environments (GitHub Actions, Jenkins, etc.)
๐ License
This project is licensed under the MIT License
You are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software.
๐ Acknowledgments & Credits
๐ Built with Love and Open Source
- ๐ Inspiration: Born from the need for standardized, reliable LLM evaluation tools
- ๐๏ธ Architecture: Built with modern Python best practices and enterprise standards
- ๐งช Testing: Comprehensive test coverage ensuring production reliability
- ๐ฅ Community: Driven by developers, researchers, and AI practitioners
- ๐ Documentation: Extensive documentation for developers at all levels
๐ง Technology Stack
| Technology | Purpose | Why We Chose It |
|---|---|---|
| ๐ Python 3.8+ | Core Language | Wide adoption, excellent ecosystem |
| ๐ Type Hints | Code Safety | Better IDE support, fewer runtime errors |
| ๐งช Pytest | Testing Framework | Industry standard, excellent plugin ecosystem |
| ๐ SQLite | Database Storage | Lightweight, serverless, reliable |
| ๐ MkDocs | Documentation | Beautiful docs, Markdown-based |
| ๐จ Rich CLI | User Interface | Modern, intuitive command-line experience |
๐ Support & Community
๐ Getting Support
| Type | Where to Go | Response Time |
|---|---|---|
| ๐ Bug Reports | GitHub Issues | 24-48 hours |
| โ Questions | GitHub Discussions | Community-driven |
| ๐ Documentation | Online Docs | Always available |
| ๐ก Feature Requests | GitHub Issues | Weekly review |
๐ Project Statistics
๐ Important Links
๐ Quick Access
| Resource | Link | Description |
|---|---|---|
| ๐ฆ PyPI Package | pypi.org/project/llm-evaluation-framework | Install via pip |
| ๐ Documentation | isathish.github.io/LLMEvaluationFramework | Complete documentation |
| ๐ป Source Code | github.com/isathish/LLMEvaluationFramework | View source & contribute |
| ๐ Issue Tracker | github.com/.../issues | Report bugs & request features |
| ๐ฌ Discussions | github.com/.../discussions | Community discussion |
๐ Thank You for Using LLM Evaluation Framework!
Made with โค๏ธ by Sathish Kumar N
If you find this project useful, please consider giving it a โญ๏ธ
๐ Ready to Get Started?
pip install LLMEvaluationFramework
๐ Read the Documentation โข ๐ View Examples โข ๐ฌ Join Discussions
Built for developers, researchers, and AI practitioners who demand reliable, production-ready LLM evaluation tools.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmevaluationframework-0.0.21.tar.gz.
File metadata
- Download URL: llmevaluationframework-0.0.21.tar.gz
- Upload date:
- Size: 56.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50d5b511583c542e18ad7fd38f6cb3a4e3bf59843f6983e53a042d294f73d997
|
|
| MD5 |
4bb4ef559e3e11e365307728b7c11b3b
|
|
| BLAKE2b-256 |
5c0083aa30d97bec6b9c6c46c4512daa1144be2df6062698019d1a29b5121b42
|
File details
Details for the file llmevaluationframework-0.0.21-py3-none-any.whl.
File metadata
- Download URL: llmevaluationframework-0.0.21-py3-none-any.whl
- Upload date:
- Size: 62.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
301023518b09a60d14e7e32c37453395db3ed7f8814fb01cc682466c29381558
|
|
| MD5 |
49399842de2f95bcb02bfde58cf5a186
|
|
| BLAKE2b-256 |
c5425fe278972d1c3c80ab250ca05208dfff8b36cc37dca24129eb10439080cf
|