Skip to main content

Fully automated LLM evaluator

Project description

AutoEvaluator: LLM-Based Evaluation Framework

PyPI version Python Version Downloads License: MIT

LinkedIn Medium Twitter Follow

AutoEvaluator is a powerful Python library that accelerates LLM output quality control through automated evaluation. Using LLMs to evaluate LLMs, it provides a simple, transparent, and developer-friendly API to identify True Positives (TP), False Positives (FP), and False Negatives (FN) in generated content against ground truth.

🚀 Features

  • Automated Evaluation: Compare LLM outputs against ground truth with precision
  • Multi-Provider Support: Works with AWS Bedrock, OpenAI, Anthropic, and Google Gemini
  • Comprehensive Metrics: Automatically calculates Precision, Recall, and F1 Score
  • Async-First Design: Built for high-performance concurrent evaluations
  • Structured Outputs: Leverages Instructor for type-safe, validated responses
  • Sentence-Level Granularity: Evaluates claims at the sentence level for detailed insights

📋 Table of Contents

🔧 Installation

Requirements

  • Python 3.9 or higher
  • An API key for at least one supported LLM provider

Install via pip

pip install autoevaluator

Install from source

git clone https://github.com/yourusername/autoevaluator.git
cd autoevaluator
pip install -e .

⚡ Quick Start

import asyncio
from autoevaluator import evaluate, get_instructor_client

async def main():
    # Setup client for your preferred provider
    client = get_instructor_client(provider="openai", model="gpt-4o-mini")
    
    # Define the claim to evaluate
    claim = "Feynman was born in 1918 in Malaysia"
    
    # Define the ground truth
    ground_truth = "Feynman was born in 1918 in America."
    
    # Evaluate the claim
    result = await evaluate(
        claim=claim,
        ground_truth=ground_truth,
        client=client,
        model_name="gpt-4o-mini"
    )
    
    print(result)

# Run the async function
asyncio.run(main())

Output:

{
    'TP': ['Feynman was born in 1918.'],
    'FP': ['Feynman was born in Malaysia.'],
    'FN': ['Feynman was born in America.'],
    'precision': 0.5,
    'recall': 0.5,
    'f1_score': 0.5
}

🔌 Supported Providers

AutoEvaluator supports multiple LLM providers out of the box:

Provider Models Environment Variables
AWS Bedrock Claude Sonnet 4.5 AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION
OpenAI GPT-4o, GPT-4o-mini, etc. OPENAI_API_KEY
Anthropic Claude Sonnet 4, etc. ANTHROPIC_API_KEY
Google Gemini Gemini 2.0 Flash, etc. GOOGLE_API_KEY

⚙️ Configuration

Environment Variables

Create a .env file in your project root:

# OpenAI
OPENAI_API_KEY=your_openai_api_key

# AWS Bedrock
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION=ap-southeast-1

# Anthropic
ANTHROPIC_API_KEY=your_anthropic_api_key

# Google Gemini
GOOGLE_API_KEY=your_google_api_key

Python Configuration

import os

# Set environment variables programmatically
os.environ["OPENAI_API_KEY"] = "your_openai_api_key"
os.environ["AWS_ACCESS_KEY_ID"] = "your_aws_access_key"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_aws_secret_key"

💡 Usage Examples

Example 1: Using OpenAI

import asyncio
from autoevaluator import evaluate, get_instructor_client

async def evaluate_with_openai():
    client = get_instructor_client(provider="openai", model="gpt-4o-mini")
    
    claim = "The Earth is flat and the moon landing was in 1969."
    ground_truth = "The Earth is round. The moon landing was in 1969."
    
    result = await evaluate(claim, ground_truth, client=client, model_name="gpt-4o-mini")
    
    print(f"True Positives: {result['TP']}")
    print(f"False Positives: {result['FP']}")
    print(f"False Negatives: {result['FN']}")
    print(f"Precision: {result['precision']:.2f}")
    print(f"Recall: {result['recall']:.2f}")
    print(f"F1 Score: {result['f1_score']:.2f}")

asyncio.run(evaluate_with_openai())

Example 2: Using AWS Bedrock

import asyncio
from autoevaluator import evaluate, get_instructor_client

async def evaluate_with_bedrock():
    client = get_instructor_client(provider="bedrock")
    
    claim = "Python was created by Guido van Rossum in 1991."
    ground_truth = "Python was created by Guido van Rossum in 1991."
    
    result = await evaluate(claim, ground_truth, client=client, model_name="bedrock-claude")
    return result

result = asyncio.run(evaluate_with_bedrock())
print(f"Perfect match! F1 Score: {result['f1_score']}")

Example 3: Using Anthropic

import asyncio
from autoevaluator import evaluate, get_instructor_client

async def evaluate_with_anthropic():
    client = get_instructor_client(
        provider="anthropic",
        model="claude-sonnet-4-20250514"
    )
    
    claim = "Water boils at 100°C at sea level."
    ground_truth = "Water boils at 100°C at sea level."
    
    result = await evaluate(claim, ground_truth, client=client, model_name="claude-sonnet-4-20250514")
    return result

result = asyncio.run(evaluate_with_anthropic())

Example 4: Batch Evaluation

import asyncio
from autoevaluator import evaluate, get_instructor_client

async def batch_evaluate():
    client = get_instructor_client(provider="openai", model="gpt-4o-mini")
    
    test_cases = [
        {
            "claim": "Einstein developed the theory of relativity.",
            "ground_truth": "Einstein developed the theory of relativity."
        },
        {
            "claim": "The capital of France is London.",
            "ground_truth": "The capital of France is Paris."
        },
        {
            "claim": "Water is composed of hydrogen and oxygen.",
            "ground_truth": "Water is composed of hydrogen and oxygen."
        }
    ]
    
    tasks = [
        evaluate(tc["claim"], tc["ground_truth"], client=client, model_name="gpt-4o-mini")
        for tc in test_cases
    ]
    
    results = await asyncio.gather(*tasks)
    
    for i, result in enumerate(results, 1):
        print(f"\n--- Test Case {i} ---")
        print(f"F1 Score: {result['f1_score']:.2f}")
        print(f"Precision: {result['precision']:.2f}")
        print(f"Recall: {result['recall']:.2f}")

asyncio.run(batch_evaluate())

📚 API Reference

evaluate()

Evaluates a claim against ground truth and returns detailed metrics.

async def evaluate(
    claim: str,
    ground_truth: str,
    client: instructor.AsyncInstructor,
    model_name: str = "gpt-4o-mini"
) -> Dict[str, Any]

Parameters:

  • claim (str): The text to be evaluated
  • ground_truth (str): The reference text to compare against
  • client (instructor.AsyncInstructor): Instructor-wrapped async client
  • model_name (str): Model identifier to use

Returns:

Dictionary containing:

  • TP (List[str]): List of true positive sentences
  • FP (List[str]): List of false positive sentences
  • FN (List[str]): List of false negative sentences
  • precision (float): Precision score (0.0 to 1.0)
  • recall (float): Recall score (0.0 to 1.0)
  • f1_score (float): F1 score (0.0 to 1.0)

get_instructor_client()

Creates an Instructor-wrapped client for the specified LLM provider.

def get_instructor_client(
    provider: Literal["bedrock", "openai", "anthropic", "gemini"] = "bedrock",
    model: Optional[str] = None,
    api_key: Optional[str] = None,
    mode: instructor.Mode = instructor.Mode.JSON,
    **kwargs
) -> instructor.AsyncInstructor

Parameters:

  • provider (str): LLM provider to use ("bedrock", "openai", "anthropic", "gemini")
  • model (Optional[str]): Model name (uses provider default if None)
  • api_key (Optional[str]): API key (falls back to environment variables)
  • mode (instructor.Mode): Instructor parsing mode
  • **kwargs: Additional provider-specific arguments

Returns:

An Instructor-wrapped async client ready for use.

text_simplifier()

Breaks down complex text into simple, single-clause sentences.

async def text_simplifier(
    text: str,
    model_name: str,
    client: instructor.AsyncInstructor
) -> TextSimplify

🔍 How It Works

AutoEvaluator uses a sophisticated multi-step process to evaluate claims:

  1. Text Simplification: Complex sentences are broken down into simple, atomic claims
  2. Question Generation: Each simplified sentence is converted into a fact-checking question
  3. Bidirectional Verification: Questions are checked against both the claim and ground truth
  4. Classification: Sentences are classified as TP, FP, or FN based on verification results
  5. Metrics Calculation: Precision, Recall, and F1 scores are computed from the classifications

Architecture

Input Claim & Ground Truth
         ↓
   Text Simplifier (breaks into atomic sentences)
         ↓
   Question Generator (creates fact-check questions)
         ↓
   Question Checker (verifies against ground truth)
         ↓
   Classification (TP/FP/FN assignment)
         ↓
   Metrics Calculation (Precision, Recall, F1)
         ↓
   Structured Output

🎯 Advanced Usage

Custom Text Simplification

from autoevaluator import text_simplifier, get_instructor_client

async def simplify_text():
    client = get_instructor_client(provider="openai")
    
    complex_text = """Although the weather was bad and it was raining heavily, 
                      we decided to go hiking because we had planned it for weeks."""
    
    result = await text_simplifier(
        text=complex_text,
        model_name="gpt-4o-mini",
        client=client
    )
    
    print("Simplified sentences:")
    for sentence in result.simplified_sentences:
        print(f"- {sentence}")

asyncio.run(simplify_text())

Using Provider-Specific Convenience Functions

from autoevaluator.client import (
    get_openai_instructor_client,
    get_bedrock_instructor_client,
    get_anthropic_instructor_client,
    get_gemini_instructor_client
)

# OpenAI
openai_client = get_openai_instructor_client(model="gpt-4o")

# Bedrock
bedrock_client = get_bedrock_instructor_client()

# Anthropic
anthropic_client = get_anthropic_instructor_client()

# Gemini
gemini_client = get_gemini_instructor_client(model="gemini-2.0-flash")

Error Handling

import asyncio
from autoevaluator import evaluate, get_instructor_client

async def safe_evaluate():
    try:
        client = get_instructor_client(provider="openai")
        result = await evaluate(
            claim="Some claim",
            ground_truth="Some truth",
            client=client,
            model_name="gpt-4o-mini"
        )
        return result
    except ValueError as e:
        print(f"Configuration error: {e}")
    except Exception as e:
        print(f"Evaluation error: {e}")

asyncio.run(safe_evaluate())

📊 Performance Considerations

  • Async by Default: All operations are asynchronous for better performance
  • Batch Processing: Use asyncio.gather() for concurrent evaluations
  • Rate Limiting: Be mindful of provider rate limits when running batch evaluations
  • Caching: Consider caching results for repeated evaluations

🤝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

🙏 Acknowledgments

  • Built with Instructor for structured outputs
  • Supports multiple LLM providers through unified interfaces
  • Inspired by the need for automated, reliable LLM evaluation

📧 Contact

Darveen Vijayan

📈 Changelog

Version 1.1.0

  • Multi-provider support (OpenAI, Bedrock, Anthropic, Gemini)
  • Async-first architecture
  • Improved text simplification
  • Enhanced error handling

Made with ❤️ by Darveen Vijayan

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoevaluator-1.1.0.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autoevaluator-1.1.0-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file autoevaluator-1.1.0.tar.gz.

File metadata

  • Download URL: autoevaluator-1.1.0.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.1 Linux/6.8.0-1030-azure

File hashes

Hashes for autoevaluator-1.1.0.tar.gz
Algorithm Hash digest
SHA256 1d47363c37dc10829a7a5cd152534ea8c3f0ac687076fb2618767b6ad67ff3c1
MD5 2fa814322c57122479becf5a839c50f8
BLAKE2b-256 bb06d0e53c6d72a8c0fed193d13201b84ae02bf6bf5d50993fad8e3dd1342278

See more details on using hashes here.

File details

Details for the file autoevaluator-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: autoevaluator-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.1 Linux/6.8.0-1030-azure

File hashes

Hashes for autoevaluator-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 335a90babe2ab6502d268a675badd1a0224a29a9d5d6d2075a8f587ea5924491
MD5 e992c7332df547050f08bdbf56672ee5
BLAKE2b-256 3a4ea8fb1121583265d3c9dffdb2c024e4ec9a450668a0fa161b1a05380e033b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page