Skip to main content

Enterprise-Grade Python Framework for Large Language Model Evaluation & Testing

Project description

๐Ÿš€ LLM Evaluation Framework

License Tests Coverage Python Documentation

๐ŸŒŸ Enterprise-Grade Python Framework for Large Language Model Evaluation & Testing ๐ŸŒŸ

Built with production-ready standards โ€ข Type-safe โ€ข Comprehensive testing โ€ข Full CLI support

๐Ÿ“š Documentation โ€ข ๏ฟฝ Quick Start โ€ข ๐Ÿ’ก Examples โ€ข ๐Ÿ› Report Issues


๐ŸŒŸ What Makes This Special?

๐ŸŽฏ Production Ready

  • 212 comprehensive tests with 89% coverage
  • Complete type hints throughout codebase
  • Robust error handling with custom exceptions
  • Enterprise-grade logging and monitoring

โšก High Performance

  • Async inference engine for concurrent evaluations
  • Batch processing capabilities
  • Cost optimization and tracking
  • Memory-efficient data handling

๏ฟฝ๏ธ Developer Friendly

  • Intuitive CLI interface for all operations
  • Comprehensive documentation with examples
  • Modular architecture for easy extension
  • Multiple storage backends (JSON, SQLite)

๐Ÿ“Š Rich Analytics

  • Multiple scoring strategies (Accuracy, F1, Custom)
  • Detailed performance metrics
  • Cost analysis and optimization
  • Exportable evaluation reports

๏ฟฝ Quick Installation

# Install from PyPI (Recommended)
pip install LLMEvaluationFramework

# Or install from source for latest features
git clone https://github.com/isathish/LLMEvaluationFramework.git
cd LLMEvaluationFramework
pip install -e .

Requirements: Python 3.8+ โ€ข No external dependencies for core functionality


โšก Quick Start

๐Ÿ Python API (Recommended)

from llm_evaluation_framework import (
    ModelRegistry, 
    ModelInferenceEngine, 
    TestDatasetGenerator
)

# 1๏ธโƒฃ Setup the registry and register your model
registry = ModelRegistry()
registry.register_model("gpt-3.5-turbo", {
    "provider": "openai",
    "api_cost_input": 0.0015,
    "api_cost_output": 0.002,
    "capabilities": ["reasoning", "creativity", "coding"]
})

# 2๏ธโƒฃ Generate test cases
generator = TestDatasetGenerator()
test_cases = generator.generate_test_cases(
    use_case={"domain": "general", "required_capabilities": ["reasoning"]},
    count=10
)

# 3๏ธโƒฃ Run evaluation
engine = ModelInferenceEngine(registry)
results = engine.evaluate_model("gpt-3.5-turbo", test_cases)

# 4๏ธโƒฃ Analyze results
print(f"โœ… Accuracy: {results['aggregate_metrics']['accuracy']:.1%}")
print(f"๐Ÿ’ฐ Total Cost: ${results['aggregate_metrics']['total_cost']:.4f}")
print(f"โฑ๏ธ  Total Time: {results['aggregate_metrics']['total_time']:.2f}s")

๐Ÿ–ฅ๏ธ Command Line Interface

# Evaluate a model with specific capabilities
llm-eval evaluate --model gpt-3.5-turbo --test-cases 10 --capability reasoning

# Generate a custom test dataset
llm-eval generate --capability coding --count 20 --output my_dataset.json

# Score predictions against references
llm-eval score --predictions "Hello world" "Good morning" \
               --references "Hello world" "Good evening" \
               --metric accuracy

# List available capabilities and models
llm-eval list

๏ฟฝ๏ธ Core Architecture

graph TB
    CLI[๐Ÿ–ฅ๏ธ CLI Interface<br/>llm-eval] --> Engine[โš™๏ธ Inference Engine<br/>ModelInferenceEngine]
    
    Engine --> Registry[๐Ÿ—„๏ธ Model Registry<br/>ModelRegistry]
    Engine --> Generator[๐Ÿงช Dataset Generator<br/>TestDatasetGenerator]
    Engine --> Scoring[๐Ÿ“Š Scoring Strategies<br/>AccuracyScoringStrategy]
    
    Registry --> Models[(๐Ÿค– Models<br/>gpt-3.5-turbo, gpt-4, etc.)]
    
    Engine --> Storage[๐Ÿ’พ Persistence Layer]
    Storage --> JSON[๐Ÿ“„ JSON Store]
    Storage --> SQLite[๐Ÿ—ƒ๏ธ SQLite Store]
    
    Engine --> Utils[๐Ÿ› ๏ธ Utilities]
    Utils --> Logger[๐Ÿ“ Advanced Logging]
    Utils --> ErrorHandler[๐Ÿ›ก๏ธ Error Handling]
    Utils --> AutoSuggest[๐Ÿ’ก Auto Suggestions]

๐ŸŽฏ Core Components

Component Description Key Features
๐Ÿ”ฅ Inference Engine Execute and evaluate LLM inferences Async processing, cost tracking, batch operations
๐Ÿ—„๏ธ Model Registry Centralized model management Multi-provider support, configuration management
๐Ÿงช Dataset Generator Create synthetic test cases Capability-based generation, domain-specific tests
๐Ÿ“Š Scoring Strategies Multiple evaluation metrics Accuracy, F1-score, custom metrics
๐Ÿ’พ Persistence Layer Dual storage backends JSON files, SQLite database with querying
๐Ÿ›ก๏ธ Error Handling Robust error management Custom exceptions, retry mechanisms
๐Ÿ“ Logging System Advanced logging capabilities File rotation, structured logging

๐ŸŽฏ Feature Highlights

๐Ÿš€ What You Can Do

๐Ÿ”ฌ Research & Benchmarking

  • Compare multiple LLM providers
  • Standardized evaluation metrics
  • Reproducible experiments
  • Performance benchmarking

๐Ÿข Enterprise Integration

  • CI/CD pipeline integration
  • Automated regression testing
  • Cost optimization analysis
  • Quality assurance workflows

๐Ÿ’ฐ Cost Management

  • Real-time cost tracking
  • Provider cost comparison
  • Budget optimization
  • ROI analysis

๐Ÿ“Š Supported Capabilities

# Available evaluation capabilities
CAPABILITIES = [
    "reasoning",      # Logical reasoning and problem-solving
    "creativity",     # Creative writing and ideation
    "factual",        # Factual accuracy and knowledge
    "instruction",    # Instruction following
    "coding"          # Code generation and debugging
]

๐ŸŽฎ Interactive Examples

๐Ÿ” Click to see Advanced Usage Examples

๐Ÿ“ˆ Batch Evaluation with Multiple Models

from llm_evaluation_framework import ModelRegistry, ModelInferenceEngine
from llm_evaluation_framework.persistence import JSONStore

# Setup multiple models
registry = ModelRegistry()
models = {
    "gpt-3.5-turbo": {"provider": "openai", "cost_input": 0.0015},
    "gpt-4": {"provider": "openai", "cost_input": 0.03},
    "claude-3": {"provider": "anthropic", "cost_input": 0.015}
}

for name, config in models.items():
    registry.register_model(name, config)

# Run comparative evaluation
engine = ModelInferenceEngine(registry)
results = {}

for model_name in models.keys():
    print(f"๐Ÿš€ Evaluating {model_name}...")
    result = engine.evaluate_model(model_name, test_cases)
    results[model_name] = result
    
    # Save results
    store = JSONStore(f"results_{model_name}.json")
    store.save_evaluation_result(result)

# Compare results
for model, result in results.items():
    accuracy = result['aggregate_metrics']['accuracy']
    cost = result['aggregate_metrics']['total_cost']
    print(f"๐Ÿ“Š {model}: {accuracy:.1%} accuracy, ${cost:.4f} cost")

๐ŸŽฏ Custom Scoring Strategy

from llm_evaluation_framework.evaluation.scoring_strategies import ScoringContext

class CustomCosineSimilarityStrategy:
    """Custom scoring using cosine similarity."""
    
    def calculate_score(self, predictions, references):
        # Your custom scoring logic here
        from sklearn.metrics.pairwise import cosine_similarity
        from sklearn.feature_extraction.text import TfidfVectorizer
        
        vectorizer = TfidfVectorizer()
        vectors = vectorizer.fit_transform(predictions + references)
        
        pred_vectors = vectors[:len(predictions)]
        ref_vectors = vectors[len(predictions):]
        
        similarities = cosine_similarity(pred_vectors, ref_vectors)
        return similarities.diagonal().mean()

# Use custom strategy
custom_strategy = CustomCosineSimilarityStrategy()
context = ScoringContext(custom_strategy)
score = context.evaluate(predictions, references)
print(f"๐ŸŽฏ Custom similarity score: {score:.3f}")

๐Ÿ”„ Async Evaluation Pipeline

import asyncio
from llm_evaluation_framework.engines.async_inference_engine import AsyncInferenceEngine

async def run_async_evaluation():
    """Run multiple evaluations concurrently."""
    
    async_engine = AsyncInferenceEngine(registry)
    
    # Define multiple evaluation tasks
    tasks = []
    for capability in ["reasoning", "creativity", "coding"]:
        task = async_engine.evaluate_async(
            model_name="gpt-3.5-turbo",
            test_cases=test_cases,
            capability=capability
        )
        tasks.append(task)
    
    # Run all evaluations concurrently
    results = await asyncio.gather(*tasks)
    
    # Process results
    for i, result in enumerate(results):
        capability = ["reasoning", "creativity", "coding"][i]
        accuracy = result['aggregate_metrics']['accuracy']
        print(f"โœ… {capability}: {accuracy:.1%}")

# Run async evaluation
asyncio.run(run_async_evaluation())

๐Ÿ“š Documentation & Resources

๐Ÿ“– Comprehensive Documentation Available

Documentation

Section Description Link
๐Ÿš€ Getting Started Installation, quick start, and basic concepts View Guide
๐Ÿง  Core Concepts Understanding the framework architecture Learn More
๐Ÿ–ฅ๏ธ CLI Usage Complete command-line interface documentation CLI Guide
๐Ÿ“Š API Reference Detailed API documentation with examples API Docs
๐Ÿ’ก Examples Practical examples and tutorials View Examples
๐Ÿ› ๏ธ Developer Guide Contributing guidelines and development setup Dev Guide

๐Ÿงช Testing & Quality

๐Ÿ† High-Quality Codebase with Comprehensive Testing

๐Ÿ“ˆ Test Coverage
89%
Comprehensive test coverage

โœ… Total Tests
212
All tests passing

๐Ÿ”ง Test Files
10+
Modular test structure

โšก Test Types
4+
Unit, Integration, Edge Cases

๐Ÿš€ Run Tests Locally

# Run all tests
pytest

# Run with detailed coverage report
pytest --cov=llm_evaluation_framework --cov-report=html

# Run specific test categories
pytest tests/test_model_inference_engine_comprehensive.py  # Core engine tests
pytest tests/test_cli_comprehensive.py                     # CLI tests
pytest tests/test_persistence_comprehensive.py            # Storage tests

# View coverage report
open htmlcov/index.html

๐Ÿ“Š Test Categories

Test Type Count Description
๐Ÿ”ง Unit Tests 150+ Individual component testing
๐Ÿ”— Integration Tests 40+ Component interaction testing
๐ŸŽฏ Edge Case Tests 20+ Error conditions and boundaries
โšก Performance Tests 10+ Speed and memory optimization

๐Ÿค Contributing

๐ŸŒŸ We Welcome Contributors!

Contributors Issues Pull Requests

๐Ÿ› ๏ธ Development Setup

# 1๏ธโƒฃ Fork and clone the repository
git clone https://github.com/YOUR_USERNAME/LLMEvaluationFramework.git
cd LLMEvaluationFramework

# 2๏ธโƒฃ Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3๏ธโƒฃ Install in development mode
pip install -e ".[dev]"

# 4๏ธโƒฃ Run tests to ensure everything works
pytest

# 5๏ธโƒฃ Install pre-commit hooks (optional but recommended)
pre-commit install

๐Ÿ“ Contribution Guidelines

  1. ๐Ÿด Fork the repository
  2. ๐ŸŒฟ Create a feature branch (git checkout -b feature/amazing-feature)
  3. โœ… Write tests for your changes
  4. ๐Ÿงช Run the test suite (pytest)
  5. ๐Ÿ“ Commit your changes (git commit -m 'Add amazing feature')
  6. ๐Ÿš€ Push to the branch (git push origin feature/amazing-feature)
  7. ๐Ÿ”€ Open a Pull Request

๐ŸŽฏ What We're Looking For

  • ๐Ÿ› Bug fixes and improvements
  • ๐Ÿ“š Documentation enhancements
  • โœจ New features and capabilities
  • ๐Ÿงช Additional test cases
  • ๐ŸŽจ UI/UX improvements for CLI
  • ๐Ÿ”ง Performance optimizations

๐Ÿ“‹ Requirements & Compatibility

๐Ÿ Python Version Support

Python Version Status Notes
Python 3.8 โœ… Supported Minimum required version
Python 3.9 โœ… Supported Fully tested
Python 3.10 โœ… Supported Recommended
Python 3.11 โœ… Supported Latest features
Python 3.12+ โœ… Supported Future-ready

๐Ÿ“ฆ Dependencies

# Core dependencies (automatically installed)
REQUIRED = [
    # No external dependencies for core functionality!
    # Framework uses only Python standard library
]

# Optional development dependencies
DEVELOPMENT = [
    "pytest>=7.0.0",           # Testing framework
    "pytest-cov>=4.0.0",      # Coverage reporting
    "black>=22.0.0",           # Code formatting
    "flake8>=5.0.0",           # Code linting
    "mypy>=1.0.0",             # Type checking
    "pre-commit>=2.20.0",      # Git hooks
]

๐ŸŒ Platform Support

  • โœ… Linux (Ubuntu, CentOS, RHEL)
  • โœ… macOS (Intel & Apple Silicon)
  • โœ… Windows (10, 11)
  • โœ… Docker containers
  • โœ… CI/CD environments (GitHub Actions, Jenkins, etc.)

๐Ÿ“„ License

License: MIT

This project is licensed under the MIT License

You are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software.

๐Ÿ“œ Read the full license


๐Ÿ™ Acknowledgments & Credits

๐ŸŒŸ Built with Love and Open Source

  • ๐Ÿš€ Inspiration: Born from the need for standardized, reliable LLM evaluation tools
  • ๐Ÿ—๏ธ Architecture: Built with modern Python best practices and enterprise standards
  • ๐Ÿงช Testing: Comprehensive test coverage ensuring production reliability
  • ๐Ÿ‘ฅ Community: Driven by developers, researchers, and AI practitioners
  • ๐Ÿ“š Documentation: Extensive documentation for developers at all levels

๐Ÿ”ง Technology Stack

Technology Purpose Why We Chose It
๐Ÿ Python 3.8+ Core Language Wide adoption, excellent ecosystem
๐Ÿ“‹ Type Hints Code Safety Better IDE support, fewer runtime errors
๐Ÿงช Pytest Testing Framework Industry standard, excellent plugin ecosystem
๐Ÿ“Š SQLite Database Storage Lightweight, serverless, reliable
๐Ÿ“ MkDocs Documentation Beautiful docs, Markdown-based
๐ŸŽจ Rich CLI User Interface Modern, intuitive command-line experience

๐Ÿ“ž Support & Community

๐Ÿ’ฌ Get Help & Connect

GitHub Issues GitHub Discussions Documentation

๐Ÿ†˜ Getting Support

Type Where to Go Response Time
๐Ÿ› Bug Reports GitHub Issues 24-48 hours
โ“ Questions GitHub Discussions Community-driven
๐Ÿ“š Documentation Online Docs Always available
๐Ÿ’ก Feature Requests GitHub Issues Weekly review

๐Ÿ“ˆ Project Statistics

GitHub stars GitHub forks GitHub watchers


๐Ÿ”— Important Links

๐ŸŒ Quick Access

Resource Link Description
๐Ÿ“ฆ PyPI Package pypi.org/project/llm-evaluation-framework Install via pip
๐Ÿ“š Documentation isathish.github.io/LLMEvaluationFramework Complete documentation
๐Ÿ’ป Source Code github.com/isathish/LLMEvaluationFramework View source & contribute
๐Ÿ› Issue Tracker github.com/.../issues Report bugs & request features
๐Ÿ’ฌ Discussions github.com/.../discussions Community discussion

๐ŸŽ‰ Thank You for Using LLM Evaluation Framework!


Made with โค๏ธ by Sathish Kumar N

If you find this project useful, please consider giving it a โญ๏ธ


Star this repo



๐Ÿš€ Ready to Get Started?

pip install LLMEvaluationFramework

๐Ÿ“š Read the Documentation โ€ข ๐Ÿš€ View Examples โ€ข ๐Ÿ’ฌ Join Discussions


Built for developers, researchers, and AI practitioners who demand reliable, production-ready LLM evaluation tools.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmevaluationframework-0.0.21.tar.gz (56.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmevaluationframework-0.0.21-py3-none-any.whl (62.8 kB view details)

Uploaded Python 3

File details

Details for the file llmevaluationframework-0.0.21.tar.gz.

File metadata

  • Download URL: llmevaluationframework-0.0.21.tar.gz
  • Upload date:
  • Size: 56.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for llmevaluationframework-0.0.21.tar.gz
Algorithm Hash digest
SHA256 50d5b511583c542e18ad7fd38f6cb3a4e3bf59843f6983e53a042d294f73d997
MD5 4bb4ef559e3e11e365307728b7c11b3b
BLAKE2b-256 5c0083aa30d97bec6b9c6c46c4512daa1144be2df6062698019d1a29b5121b42

See more details on using hashes here.

File details

Details for the file llmevaluationframework-0.0.21-py3-none-any.whl.

File metadata

File hashes

Hashes for llmevaluationframework-0.0.21-py3-none-any.whl
Algorithm Hash digest
SHA256 301023518b09a60d14e7e32c37453395db3ed7f8814fb01cc682466c29381558
MD5 49399842de2f95bcb02bfde58cf5a186
BLAKE2b-256 c5425fe278972d1c3c80ab250ca05208dfff8b36cc37dca24129eb10439080cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page