Enterprise-Grade Python Framework for Large Language Model Evaluation & Testing

These details have not been verified by PyPI

Project links

Project description

🚀 LLM Evaluation Framework

🌟 Enterprise-Grade Python Framework for Large Language Model Evaluation & Testing 🌟

Built with production-ready standards • Type-safe • Comprehensive testing • Full CLI support

📚 Documentation • � Quick Start • 💡 Examples • 🐛 Report Issues

🌟 What Makes This Special?

🎯 Production Ready

212 comprehensive tests with 89% coverage
Complete type hints throughout codebase
Robust error handling with custom exceptions
Enterprise-grade logging and monitoring

⚡ High Performance

Async inference engine for concurrent evaluations
Batch processing capabilities
Cost optimization and tracking
Memory-efficient data handling

�️ Developer Friendly

Intuitive CLI interface for all operations
Comprehensive documentation with examples
Modular architecture for easy extension
Multiple storage backends (JSON, SQLite)

📊 Rich Analytics

Multiple scoring strategies (Accuracy, F1, Custom)
Detailed performance metrics
Cost analysis and optimization
Exportable evaluation reports

� Quick Installation

# Install from PyPI (Recommended)
pip install LLMEvaluationFramework

# Or install from source for latest features
git clone https://github.com/isathish/LLMEvaluationFramework.git
cd LLMEvaluationFramework
pip install -e .

Requirements: Python 3.8+ • No external dependencies for core functionality

⚡ Quick Start

🐍 Python API (Recommended)

from llm_evaluation_framework import (
    ModelRegistry, 
    ModelInferenceEngine, 
    TestDatasetGenerator
)

# 1️⃣ Setup the registry and register your model
registry = ModelRegistry()
registry.register_model("gpt-3.5-turbo", {
    "provider": "openai",
    "api_cost_input": 0.0015,
    "api_cost_output": 0.002,
    "capabilities": ["reasoning", "creativity", "coding"]
})

# 2️⃣ Generate test cases
generator = TestDatasetGenerator()
test_cases = generator.generate_test_cases(
    use_case={"domain": "general", "required_capabilities": ["reasoning"]},
    count=10
)

# 3️⃣ Run evaluation
engine = ModelInferenceEngine(registry)
results = engine.evaluate_model("gpt-3.5-turbo", test_cases)

# 4️⃣ Analyze results
print(f"✅ Accuracy: {results['aggregate_metrics']['accuracy']:.1%}")
print(f"💰 Total Cost: ${results['aggregate_metrics']['total_cost']:.4f}")
print(f"⏱️  Total Time: {results['aggregate_metrics']['total_time']:.2f}s")

🖥️ Command Line Interface

# Evaluate a model with specific capabilities
llm-eval evaluate --model gpt-3.5-turbo --test-cases 10 --capability reasoning

# Generate a custom test dataset
llm-eval generate --capability coding --count 20 --output my_dataset.json

# Score predictions against references
llm-eval score --predictions "Hello world" "Good morning" \
               --references "Hello world" "Good evening" \
               --metric accuracy

# List available capabilities and models
llm-eval list

�️ Core Architecture

graph TB
    CLI[🖥️ CLI Interface<br/>llm-eval] --> Engine[⚙️ Inference Engine<br/>ModelInferenceEngine]
    
    Engine --> Registry[🗄️ Model Registry<br/>ModelRegistry]
    Engine --> Generator[🧪 Dataset Generator<br/>TestDatasetGenerator]
    Engine --> Scoring[📊 Scoring Strategies<br/>AccuracyScoringStrategy]
    
    Registry --> Models[(🤖 Models<br/>gpt-3.5-turbo, gpt-4, etc.)]
    
    Engine --> Storage[💾 Persistence Layer]
    Storage --> JSON[📄 JSON Store]
    Storage --> SQLite[🗃️ SQLite Store]
    
    Engine --> Utils[🛠️ Utilities]
    Utils --> Logger[📝 Advanced Logging]
    Utils --> ErrorHandler[🛡️ Error Handling]
    Utils --> AutoSuggest[💡 Auto Suggestions]

🎯 Core Components

Component	Description	Key Features
🔥 Inference Engine	Execute and evaluate LLM inferences	Async processing, cost tracking, batch operations
🗄️ Model Registry	Centralized model management	Multi-provider support, configuration management
🧪 Dataset Generator	Create synthetic test cases	Capability-based generation, domain-specific tests
📊 Scoring Strategies	Multiple evaluation metrics	Accuracy, F1-score, custom metrics
💾 Persistence Layer	Dual storage backends	JSON files, SQLite database with querying
🛡️ Error Handling	Robust error management	Custom exceptions, retry mechanisms
📝 Logging System	Advanced logging capabilities	File rotation, structured logging

🎯 Feature Highlights

🚀 What You Can Do

🔬 Research & Benchmarking

Compare multiple LLM providers
Standardized evaluation metrics
Reproducible experiments
Performance benchmarking

🏢 Enterprise Integration

CI/CD pipeline integration
Automated regression testing
Cost optimization analysis
Quality assurance workflows

💰 Cost Management

Real-time cost tracking
Provider cost comparison
Budget optimization
ROI analysis

📊 Supported Capabilities

# Available evaluation capabilities
CAPABILITIES = [
    "reasoning",      # Logical reasoning and problem-solving
    "creativity",     # Creative writing and ideation
    "factual",        # Factual accuracy and knowledge
    "instruction",    # Instruction following
    "coding"          # Code generation and debugging
]

🎮 Interactive Examples

🔍 Click to see Advanced Usage Examples

📈 Batch Evaluation with Multiple Models

from llm_evaluation_framework import ModelRegistry, ModelInferenceEngine
from llm_evaluation_framework.persistence import JSONStore

# Setup multiple models
registry = ModelRegistry()
models = {
    "gpt-3.5-turbo": {"provider": "openai", "cost_input": 0.0015},
    "gpt-4": {"provider": "openai", "cost_input": 0.03},
    "claude-3": {"provider": "anthropic", "cost_input": 0.015}
}

for name, config in models.items():
    registry.register_model(name, config)

# Run comparative evaluation
engine = ModelInferenceEngine(registry)
results = {}

for model_name in models.keys():
    print(f"🚀 Evaluating {model_name}...")
    result = engine.evaluate_model(model_name, test_cases)
    results[model_name] = result
    
    # Save results
    store = JSONStore(f"results_{model_name}.json")
    store.save_evaluation_result(result)

# Compare results
for model, result in results.items():
    accuracy = result['aggregate_metrics']['accuracy']
    cost = result['aggregate_metrics']['total_cost']
    print(f"📊 {model}: {accuracy:.1%} accuracy, ${cost:.4f} cost")

🎯 Custom Scoring Strategy

from llm_evaluation_framework.evaluation.scoring_strategies import ScoringContext

class CustomCosineSimilarityStrategy:
    """Custom scoring using cosine similarity."""
    
    def calculate_score(self, predictions, references):
        # Your custom scoring logic here
        from sklearn.metrics.pairwise import cosine_similarity
        from sklearn.feature_extraction.text import TfidfVectorizer
        
        vectorizer = TfidfVectorizer()
        vectors = vectorizer.fit_transform(predictions + references)
        
        pred_vectors = vectors[:len(predictions)]
        ref_vectors = vectors[len(predictions):]
        
        similarities = cosine_similarity(pred_vectors, ref_vectors)
        return similarities.diagonal().mean()

# Use custom strategy
custom_strategy = CustomCosineSimilarityStrategy()
context = ScoringContext(custom_strategy)
score = context.evaluate(predictions, references)
print(f"🎯 Custom similarity score: {score:.3f}")

🔄 Async Evaluation Pipeline

import asyncio
from llm_evaluation_framework.engines.async_inference_engine import AsyncInferenceEngine

async def run_async_evaluation():
    """Run multiple evaluations concurrently."""
    
    async_engine = AsyncInferenceEngine(registry)
    
    # Define multiple evaluation tasks
    tasks = []
    for capability in ["reasoning", "creativity", "coding"]:
        task = async_engine.evaluate_async(
            model_name="gpt-3.5-turbo",
            test_cases=test_cases,
            capability=capability
        )
        tasks.append(task)
    
    # Run all evaluations concurrently
    results = await asyncio.gather(*tasks)
    
    # Process results
    for i, result in enumerate(results):
        capability = ["reasoning", "creativity", "coding"][i]
        accuracy = result['aggregate_metrics']['accuracy']
        print(f"✅ {capability}: {accuracy:.1%}")

# Run async evaluation
asyncio.run(run_async_evaluation())

📚 Documentation & Resources

📖 Comprehensive Documentation Available

Section	Description	Link
🚀 Getting Started	Installation, quick start, and basic concepts	View Guide
🧠 Core Concepts	Understanding the framework architecture	Learn More
🖥️ CLI Usage	Complete command-line interface documentation	CLI Guide
📊 API Reference	Detailed API documentation with examples	API Docs
💡 Examples	Practical examples and tutorials	View Examples
🛠️ Developer Guide	Contributing guidelines and development setup	Dev Guide

🧪 Testing & Quality

🏆 High-Quality Codebase with Comprehensive Testing

📈 Test Coverage
89%
Comprehensive test coverage

✅ Total Tests
212
All tests passing

🔧 Test Files
10+
Modular test structure

⚡ Test Types
4+
Unit, Integration, Edge Cases

🚀 Run Tests Locally

# Run all tests
pytest

# Run with detailed coverage report
pytest --cov=llm_evaluation_framework --cov-report=html

# Run specific test categories
pytest tests/test_model_inference_engine_comprehensive.py  # Core engine tests
pytest tests/test_cli_comprehensive.py                     # CLI tests
pytest tests/test_persistence_comprehensive.py            # Storage tests

# View coverage report
open htmlcov/index.html

📊 Test Categories

Test Type	Count	Description
🔧 Unit Tests	150+	Individual component testing
🔗 Integration Tests	40+	Component interaction testing
🎯 Edge Case Tests	20+	Error conditions and boundaries
⚡ Performance Tests	10+	Speed and memory optimization

🤝 Contributing

🌟 We Welcome Contributors!

🛠️ Development Setup

# 1️⃣ Fork and clone the repository
git clone https://github.com/YOUR_USERNAME/LLMEvaluationFramework.git
cd LLMEvaluationFramework

# 2️⃣ Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3️⃣ Install in development mode
pip install -e ".[dev]"

# 4️⃣ Run tests to ensure everything works
pytest

# 5️⃣ Install pre-commit hooks (optional but recommended)
pre-commit install

📝 Contribution Guidelines

🍴 Fork the repository
🌿 Create a feature branch (git checkout -b feature/amazing-feature)
✅ Write tests for your changes
🧪 Run the test suite (pytest)
📝 Commit your changes (git commit -m 'Add amazing feature')
🚀 Push to the branch (git push origin feature/amazing-feature)
🔀 Open a Pull Request

🎯 What We're Looking For

🐛 Bug fixes and improvements
📚 Documentation enhancements
✨ New features and capabilities
🧪 Additional test cases
🎨 UI/UX improvements for CLI
🔧 Performance optimizations

📋 Requirements & Compatibility

🐍 Python Version Support

Python Version	Status	Notes
Python 3.8	✅ Supported	Minimum required version
Python 3.9	✅ Supported	Fully tested
Python 3.10	✅ Supported	Recommended
Python 3.11	✅ Supported	Latest features
Python 3.12+	✅ Supported	Future-ready

📦 Dependencies

# Core dependencies (automatically installed)
REQUIRED = [
    # No external dependencies for core functionality!
    # Framework uses only Python standard library
]

# Optional development dependencies
DEVELOPMENT = [
    "pytest>=7.0.0",           # Testing framework
    "pytest-cov>=4.0.0",      # Coverage reporting
    "black>=22.0.0",           # Code formatting
    "flake8>=5.0.0",           # Code linting
    "mypy>=1.0.0",             # Type checking
    "pre-commit>=2.20.0",      # Git hooks
]

🌐 Platform Support

✅ Linux (Ubuntu, CentOS, RHEL)
✅ macOS (Intel & Apple Silicon)
✅ Windows (10, 11)
✅ Docker containers
✅ CI/CD environments (GitHub Actions, Jenkins, etc.)

📄 License

This project is licensed under the MIT License

You are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software.

📜 Read the full license

🙏 Acknowledgments & Credits

🌟 Built with Love and Open Source

🚀 Inspiration: Born from the need for standardized, reliable LLM evaluation tools
🏗️ Architecture: Built with modern Python best practices and enterprise standards
🧪 Testing: Comprehensive test coverage ensuring production reliability
👥 Community: Driven by developers, researchers, and AI practitioners
📚 Documentation: Extensive documentation for developers at all levels

🔧 Technology Stack

Technology	Purpose	Why We Chose It
🐍 Python 3.8+	Core Language	Wide adoption, excellent ecosystem
📋 Type Hints	Code Safety	Better IDE support, fewer runtime errors
🧪 Pytest	Testing Framework	Industry standard, excellent plugin ecosystem
📊 SQLite	Database Storage	Lightweight, serverless, reliable
📝 MkDocs	Documentation	Beautiful docs, Markdown-based
🎨 Rich CLI	User Interface	Modern, intuitive command-line experience

📞 Support & Community

💬 Get Help & Connect

🆘 Getting Support

Type	Where to Go	Response Time
🐛 Bug Reports	GitHub Issues	24-48 hours
❓ Questions	GitHub Discussions	Community-driven
📚 Documentation	Online Docs	Always available
💡 Feature Requests	GitHub Issues	Weekly review

📈 Project Statistics

GitHub stars GitHub forks GitHub watchers

🔗 Important Links

🌐 Quick Access

Resource	Link	Description
📦 PyPI Package	pypi.org/project/llm-evaluation-framework	Install via pip
📚 Documentation	isathish.github.io/LLMEvaluationFramework	Complete documentation
💻 Source Code	github.com/isathish/LLMEvaluationFramework	View source & contribute
🐛 Issue Tracker	github.com/.../issues	Report bugs & request features
💬 Discussions	github.com/.../discussions	Community discussion

🎉 Thank You for Using LLM Evaluation Framework!

Made with ❤️ by Sathish Kumar N

If you find this project useful, please consider giving it a ⭐️

🚀 Ready to Get Started?

pip install LLMEvaluationFramework

📚 Read the Documentation • 🚀 View Examples • 💬 Join Discussions

Built for developers, researchers, and AI practitioners who demand reliable, production-ready LLM evaluation tools.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.21

Oct 12, 2025

0.0.20

Oct 12, 2025

0.0.19

Oct 12, 2025

0.0.18

Aug 24, 2025

0.0.17

Aug 24, 2025

0.0.16

Aug 24, 2025

0.0.15

Aug 24, 2025

0.0.14

Aug 24, 2025

0.0.13

Aug 23, 2025

0.0.12

Aug 23, 2025

0.0.11

Aug 23, 2025

0.0.10

Aug 23, 2025

0.0.9

Aug 23, 2025

0.0.8

Aug 23, 2025

0.0.7

Aug 23, 2025

0.0.6

Aug 23, 2025

0.0.5

Aug 23, 2025

0.0.4

Aug 23, 2025

0.0.3

Aug 23, 2025

0.0.2

Aug 23, 2025

0.0.1

Aug 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmevaluationframework-0.0.21.tar.gz (56.9 kB view details)

Uploaded Oct 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmevaluationframework-0.0.21-py3-none-any.whl (62.8 kB view details)

Uploaded Oct 12, 2025 Python 3

File details

Details for the file llmevaluationframework-0.0.21.tar.gz.

File metadata

Download URL: llmevaluationframework-0.0.21.tar.gz
Upload date: Oct 12, 2025
Size: 56.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for llmevaluationframework-0.0.21.tar.gz
Algorithm	Hash digest
SHA256	`50d5b511583c542e18ad7fd38f6cb3a4e3bf59843f6983e53a042d294f73d997`
MD5	`4bb4ef559e3e11e365307728b7c11b3b`
BLAKE2b-256	`5c0083aa30d97bec6b9c6c46c4512daa1144be2df6062698019d1a29b5121b42`

See more details on using hashes here.

File details

Details for the file llmevaluationframework-0.0.21-py3-none-any.whl.

File metadata

Download URL: llmevaluationframework-0.0.21-py3-none-any.whl
Upload date: Oct 12, 2025
Size: 62.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for llmevaluationframework-0.0.21-py3-none-any.whl
Algorithm	Hash digest
SHA256	`301023518b09a60d14e7e32c37453395db3ed7f8814fb01cc682466c29381558`
MD5	`49399842de2f95bcb02bfde58cf5a186`
BLAKE2b-256	`c5425fe278972d1c3c80ab250ca05208dfff8b36cc37dca24129eb10439080cf`

See more details on using hashes here.

LLMEvaluationFramework 0.0.21

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🚀 LLM Evaluation Framework

🌟 What Makes This Special?

🎯 Production Ready

⚡ High Performance

�️ Developer Friendly

📊 Rich Analytics

� Quick Installation

⚡ Quick Start

🐍 Python API (Recommended)

🖥️ Command Line Interface

�️ Core Architecture

🎯 Core Components

🎯 Feature Highlights

🚀 What You Can Do

🔬 Research & Benchmarking

🏢 Enterprise Integration

💰 Cost Management

📊 Supported Capabilities

🎮 Interactive Examples

📈 Batch Evaluation with Multiple Models

🎯 Custom Scoring Strategy

🔄 Async Evaluation Pipeline

📚 Documentation & Resources

📖 Comprehensive Documentation Available

🧪 Testing & Quality

🏆 High-Quality Codebase with Comprehensive Testing

🚀 Run Tests Locally

📊 Test Categories

🤝 Contributing

🌟 We Welcome Contributors!

🛠️ Development Setup

📝 Contribution Guidelines

🎯 What We're Looking For

📋 Requirements & Compatibility

🐍 Python Version Support

📦 Dependencies

🌐 Platform Support

📄 License

🙏 Acknowledgments & Credits

🌟 Built with Love and Open Source

🔧 Technology Stack

📞 Support & Community

💬 Get Help & Connect

🆘 Getting Support

📈 Project Statistics

🔗 Important Links

🌐 Quick Access

🎉 Thank You for Using LLM Evaluation Framework!

🚀 Ready to Get Started?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes