Fully automated LLM evaluator

These details have not been verified by PyPI

Project description

AutoEvaluator: LLM-Based Evaluation Framework

AutoEvaluator is a powerful Python library that accelerates LLM output quality control through automated evaluation. Using LLMs to evaluate LLMs, it provides a simple, transparent, and developer-friendly API to identify True Positives (TP), False Positives (FP), and False Negatives (FN) in generated content against ground truth.

🚀 Features

Automated Evaluation: Compare LLM outputs against ground truth with precision
Multi-Provider Support: Works with AWS Bedrock, OpenAI, Anthropic, and Google Gemini
Comprehensive Metrics: Automatically calculates Precision, Recall, and F1 Score
Async-First Design: Built for high-performance concurrent evaluations
Structured Outputs: Leverages Instructor for type-safe, validated responses
Sentence-Level Granularity: Evaluates claims at the sentence level for detailed insights

🔧 Installation

Requirements

Python 3.9 or higher
An API key for at least one supported LLM provider

Install via pip

pip install autoevaluator

Install from source

git clone https://github.com/yourusername/autoevaluator.git
cd autoevaluator
pip install -e .

⚡ Quick Start

import asyncio
from autoevaluator import evaluate, get_instructor_client

async def main():
    # Setup client for your preferred provider
    client = get_instructor_client(provider="openai", model="gpt-4o-mini")
    
    # Define the claim to evaluate
    claim = "Feynman was born in 1918 in Malaysia"
    
    # Define the ground truth
    ground_truth = "Feynman was born in 1918 in America."
    
    # Evaluate the claim
    result = await evaluate(
        claim=claim,
        ground_truth=ground_truth,
        client=client,
        model_name="gpt-4o-mini"
    )
    
    print(result)

# Run the async function
asyncio.run(main())

Output:

{
    'TP': ['Feynman was born in 1918.'],
    'FP': ['Feynman was born in Malaysia.'],
    'FN': ['Feynman was born in America.'],
    'precision': 0.5,
    'recall': 0.5,
    'f1_score': 0.5
}

🔌 Supported Providers

AutoEvaluator supports multiple LLM providers out of the box:

Provider	Models	Environment Variables
AWS Bedrock	Claude Sonnet 4.5	`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`
OpenAI	GPT-4o, GPT-4o-mini, etc.	`OPENAI_API_KEY`
Anthropic	Claude Sonnet 4, etc.	`ANTHROPIC_API_KEY`
Google Gemini	Gemini 2.0 Flash, etc.	`GOOGLE_API_KEY`

⚙️ Configuration

Environment Variables

Create a .env file in your project root:

# OpenAI
OPENAI_API_KEY=your_openai_api_key

# AWS Bedrock
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION=ap-southeast-1

# Anthropic
ANTHROPIC_API_KEY=your_anthropic_api_key

# Google Gemini
GOOGLE_API_KEY=your_google_api_key

Python Configuration

import os

# Set environment variables programmatically
os.environ["OPENAI_API_KEY"] = "your_openai_api_key"
os.environ["AWS_ACCESS_KEY_ID"] = "your_aws_access_key"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_aws_secret_key"

💡 Usage Examples

Example 1: Using OpenAI

import asyncio
from autoevaluator import evaluate, get_instructor_client

async def evaluate_with_openai():
    client = get_instructor_client(provider="openai", model="gpt-4o-mini")
    
    claim = "The Earth is flat and the moon landing was in 1969."
    ground_truth = "The Earth is round. The moon landing was in 1969."
    
    result = await evaluate(claim, ground_truth, client=client, model_name="gpt-4o-mini")
    
    print(f"True Positives: {result['TP']}")
    print(f"False Positives: {result['FP']}")
    print(f"False Negatives: {result['FN']}")
    print(f"Precision: {result['precision']:.2f}")
    print(f"Recall: {result['recall']:.2f}")
    print(f"F1 Score: {result['f1_score']:.2f}")

asyncio.run(evaluate_with_openai())

Example 2: Using AWS Bedrock

import asyncio
from autoevaluator import evaluate, get_instructor_client

async def evaluate_with_bedrock():
    client = get_instructor_client(provider="bedrock")
    
    claim = "Python was created by Guido van Rossum in 1991."
    ground_truth = "Python was created by Guido van Rossum in 1991."
    
    result = await evaluate(claim, ground_truth, client=client, model_name="bedrock-claude")
    return result

result = asyncio.run(evaluate_with_bedrock())
print(f"Perfect match! F1 Score: {result['f1_score']}")

Example 3: Using Anthropic

import asyncio
from autoevaluator import evaluate, get_instructor_client

async def evaluate_with_anthropic():
    client = get_instructor_client(
        provider="anthropic",
        model="claude-sonnet-4-20250514"
    )
    
    claim = "Water boils at 100°C at sea level."
    ground_truth = "Water boils at 100°C at sea level."
    
    result = await evaluate(claim, ground_truth, client=client, model_name="claude-sonnet-4-20250514")
    return result

result = asyncio.run(evaluate_with_anthropic())

Example 4: Batch Evaluation

import asyncio
from autoevaluator import evaluate, get_instructor_client

async def batch_evaluate():
    client = get_instructor_client(provider="openai", model="gpt-4o-mini")
    
    test_cases = [
        {
            "claim": "Einstein developed the theory of relativity.",
            "ground_truth": "Einstein developed the theory of relativity."
        },
        {
            "claim": "The capital of France is London.",
            "ground_truth": "The capital of France is Paris."
        },
        {
            "claim": "Water is composed of hydrogen and oxygen.",
            "ground_truth": "Water is composed of hydrogen and oxygen."
        }
    ]
    
    tasks = [
        evaluate(tc["claim"], tc["ground_truth"], client=client, model_name="gpt-4o-mini")
        for tc in test_cases
    ]
    
    results = await asyncio.gather(*tasks)
    
    for i, result in enumerate(results, 1):
        print(f"\n--- Test Case {i} ---")
        print(f"F1 Score: {result['f1_score']:.2f}")
        print(f"Precision: {result['precision']:.2f}")
        print(f"Recall: {result['recall']:.2f}")

asyncio.run(batch_evaluate())

📚 API Reference

`evaluate()`

Evaluates a claim against ground truth and returns detailed metrics.

async def evaluate(
    claim: str,
    ground_truth: str,
    client: instructor.AsyncInstructor,
    model_name: str = "gpt-4o-mini"
) -> Dict[str, Any]

Parameters:

claim (str): The text to be evaluated
ground_truth (str): The reference text to compare against
client (instructor.AsyncInstructor): Instructor-wrapped async client
model_name (str): Model identifier to use

Returns:

Dictionary containing:

TP (List[str]): List of true positive sentences
FP (List[str]): List of false positive sentences
FN (List[str]): List of false negative sentences
precision (float): Precision score (0.0 to 1.0)
recall (float): Recall score (0.0 to 1.0)
f1_score (float): F1 score (0.0 to 1.0)

`get_instructor_client()`

Creates an Instructor-wrapped client for the specified LLM provider.

def get_instructor_client(
    provider: Literal["bedrock", "openai", "anthropic", "gemini"] = "bedrock",
    model: Optional[str] = None,
    api_key: Optional[str] = None,
    mode: instructor.Mode = instructor.Mode.JSON,
    **kwargs
) -> instructor.AsyncInstructor

Parameters:

provider (str): LLM provider to use ("bedrock", "openai", "anthropic", "gemini")
model (Optional[str]): Model name (uses provider default if None)
api_key (Optional[str]): API key (falls back to environment variables)
mode (instructor.Mode): Instructor parsing mode
**kwargs: Additional provider-specific arguments

Returns:

An Instructor-wrapped async client ready for use.

`text_simplifier()`

Breaks down complex text into simple, single-clause sentences.

async def text_simplifier(
    text: str,
    model_name: str,
    client: instructor.AsyncInstructor
) -> TextSimplify

🔍 How It Works

AutoEvaluator uses a sophisticated multi-step process to evaluate claims:

Text Simplification: Complex sentences are broken down into simple, atomic claims
Question Generation: Each simplified sentence is converted into a fact-checking question
Bidirectional Verification: Questions are checked against both the claim and ground truth
Classification: Sentences are classified as TP, FP, or FN based on verification results
Metrics Calculation: Precision, Recall, and F1 scores are computed from the classifications

Architecture

Input Claim & Ground Truth
         ↓
   Text Simplifier (breaks into atomic sentences)
         ↓
   Question Generator (creates fact-check questions)
         ↓
   Question Checker (verifies against ground truth)
         ↓
   Classification (TP/FP/FN assignment)
         ↓
   Metrics Calculation (Precision, Recall, F1)
         ↓
   Structured Output

🎯 Advanced Usage

Custom Text Simplification

from autoevaluator import text_simplifier, get_instructor_client

async def simplify_text():
    client = get_instructor_client(provider="openai")
    
    complex_text = """Although the weather was bad and it was raining heavily, 
                      we decided to go hiking because we had planned it for weeks."""
    
    result = await text_simplifier(
        text=complex_text,
        model_name="gpt-4o-mini",
        client=client
    )
    
    print("Simplified sentences:")
    for sentence in result.simplified_sentences:
        print(f"- {sentence}")

asyncio.run(simplify_text())

Using Provider-Specific Convenience Functions

from autoevaluator.client import (
    get_openai_instructor_client,
    get_bedrock_instructor_client,
    get_anthropic_instructor_client,
    get_gemini_instructor_client
)

# OpenAI
openai_client = get_openai_instructor_client(model="gpt-4o")

# Bedrock
bedrock_client = get_bedrock_instructor_client()

# Anthropic
anthropic_client = get_anthropic_instructor_client()

# Gemini
gemini_client = get_gemini_instructor_client(model="gemini-2.0-flash")

Error Handling

import asyncio
from autoevaluator import evaluate, get_instructor_client

async def safe_evaluate():
    try:
        client = get_instructor_client(provider="openai")
        result = await evaluate(
            claim="Some claim",
            ground_truth="Some truth",
            client=client,
            model_name="gpt-4o-mini"
        )
        return result
    except ValueError as e:
        print(f"Configuration error: {e}")
    except Exception as e:
        print(f"Evaluation error: {e}")

asyncio.run(safe_evaluate())

📊 Performance Considerations

Async by Default: All operations are asynchronous for better performance
Batch Processing: Use asyncio.gather() for concurrent evaluations
Rate Limiting: Be mindful of provider rate limits when running batch evaluations
Caching: Consider caching results for repeated evaluations

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

🙏 Acknowledgments

Built with Instructor for structured outputs
Supports multiple LLM providers through unified interfaces
Inspired by the need for automated, reliable LLM evaluation

📧 Contact

Darveen Vijayan

📈 Changelog

Version 1.1.0

Multi-provider support (OpenAI, Bedrock, Anthropic, Gemini)
Async-first architecture
Improved text simplification
Enhanced error handling

Made with ❤️ by Darveen Vijayan

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.1.3

Jan 17, 2026

1.1.2

Jan 16, 2026

1.1.1

Jan 16, 2026

This version

1.1.0

Jan 16, 2026

1.0.3

Oct 14, 2024

1.0.2

Oct 14, 2024

1.0.1

Oct 13, 2024

1.0.0

Oct 12, 2024

0.2.6

Oct 7, 2024

0.2.5

Oct 7, 2024

0.2.4

Oct 7, 2024

0.2.3

Oct 7, 2024

0.2.2

Oct 7, 2024

0.2.1

Oct 7, 2024

0.2.0

Oct 4, 2024

0.1.5

Oct 4, 2024

0.1.4

Oct 4, 2024

0.1.3

Oct 4, 2024

0.1.2

Oct 4, 2024

0.1.1

Oct 4, 2024

0.1.0

Oct 3, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoevaluator-1.1.0.tar.gz (18.3 kB view details)

Uploaded Jan 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autoevaluator-1.1.0-py3-none-any.whl (18.1 kB view details)

Uploaded Jan 16, 2026 Python 3

File details

Details for the file autoevaluator-1.1.0.tar.gz.

File metadata

Download URL: autoevaluator-1.1.0.tar.gz
Upload date: Jan 16, 2026
Size: 18.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.12.1 Linux/6.8.0-1030-azure

File hashes

Hashes for autoevaluator-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1d47363c37dc10829a7a5cd152534ea8c3f0ac687076fb2618767b6ad67ff3c1`
MD5	`2fa814322c57122479becf5a839c50f8`
BLAKE2b-256	`bb06d0e53c6d72a8c0fed193d13201b84ae02bf6bf5d50993fad8e3dd1342278`

See more details on using hashes here.

File details

Details for the file autoevaluator-1.1.0-py3-none-any.whl.

File metadata

Download URL: autoevaluator-1.1.0-py3-none-any.whl
Upload date: Jan 16, 2026
Size: 18.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.12.1 Linux/6.8.0-1030-azure

File hashes

Hashes for autoevaluator-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`335a90babe2ab6502d268a675badd1a0224a29a9d5d6d2075a8f587ea5924491`
MD5	`e992c7332df547050f08bdbf56672ee5`
BLAKE2b-256	`3a4ea8fb1121583265d3c9dffdb2c024e4ec9a450668a0fa161b1a05380e033b`

See more details on using hashes here.

autoevaluator 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

AutoEvaluator: LLM-Based Evaluation Framework

🚀 Features

📋 Table of Contents

🔧 Installation

Requirements

Install via pip

Install from source

⚡ Quick Start

🔌 Supported Providers

⚙️ Configuration

Environment Variables

Python Configuration

💡 Usage Examples

Example 1: Using OpenAI

Example 2: Using AWS Bedrock

Example 3: Using Anthropic

Example 4: Batch Evaluation

📚 API Reference

evaluate()

get_instructor_client()

text_simplifier()

🔍 How It Works

Architecture

🎯 Advanced Usage

Custom Text Simplification

Using Provider-Specific Convenience Functions

Error Handling

📊 Performance Considerations

🤝 Contributing

📄 License

🙏 Acknowledgments

📧 Contact

📈 Changelog

Version 1.1.0

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`evaluate()`

`get_instructor_client()`

`text_simplifier()`