evalops - TGSC

These details have not been verified by PyPI

Project links

Project description

Statistical Model Evaluator

A robust, production-ready framework for statistically rigorous evaluation of language models, implementing the methodology described in "A Statistical Approach to Model Evaluations" (2024).

🚀 Features

Statistical Robustness: Leverages Central Limit Theorem for reliable metrics
Clustered Standard Errors: Handles non-independent question groups
Variance Reduction: Multiple sampling strategies and parallel processing
Paired Difference Analysis: Sophisticated model comparison tools
Power Analysis: Sample size determination for meaningful comparisons
Production Ready:
- Comprehensive logging
- Type hints throughout
- Error handling
- Result caching
- Parallel processing
- Modular design

📋 Requirements

pip3 install -r requirements.txt

Usage

import os

from dotenv import load_dotenv
from swarm_models import OpenAIChat
from swarms import Agent

from evalops import StatisticalModelEvaluator

load_dotenv()

# Get the OpenAI API key from the environment variable
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("OPENAI_API_KEY environment variable not set")

# Create instances of the OpenAIChat class with different models
model_gpt4 = OpenAIChat(
    openai_api_key=api_key, model_name="gpt-4o", temperature=0.1
)

model_gpt35 = OpenAIChat(
    openai_api_key=api_key, model_name="gpt-4o-mini", temperature=0.1
)

# Initialize a general knowledge agent
agent = Agent(
    agent_name="General-Knowledge-Agent",
    system_prompt="You are a helpful assistant that answers general knowledge questions accurately and concisely.",
    llm=model_gpt4,
    max_loops=1,
    dynamic_temperature_enabled=True,
    saved_state_path="general_agent.json",
    user_name="swarms_corp",
    context_length=200000,
    return_step_meta=False,
    output_type="string",
)

evaluator = StatisticalModelEvaluator(cache_dir="./eval_cache")

# General knowledge test cases
general_questions = [
    "What is the capital of France?",
    "Who wrote Romeo and Juliet?",
    "What is the largest planet in our solar system?",
    "What is the chemical symbol for gold?",
    "Who painted the Mona Lisa?",
]

general_answers = [
    "Paris",
    "William Shakespeare",
    "Jupiter",
    "Au",
    "Leonardo da Vinci",
]

# Evaluate models on general knowledge questions
result_gpt4 = evaluator.evaluate_model(
    model=agent,
    questions=general_questions,
    correct_answers=general_answers,
    num_samples=5,
)

result_gpt35 = evaluator.evaluate_model(
    model=agent,
    questions=general_questions,
    correct_answers=general_answers,
    num_samples=5,
)

# Compare model performance
comparison = evaluator.compare_models(result_gpt4, result_gpt35)

# Print results
print(f"GPT-4 Mean Score: {result_gpt4.mean_score:.3f}")
print(f"GPT-3.5 Mean Score: {result_gpt35.mean_score:.3f}")
print(
    f"Significant Difference: {comparison['significant_difference']}"
)
print(f"P-value: {comparison['p_value']:.3f}")

📖 Detailed Usage

Basic Model Evaluation

class MyLanguageModel:
    def run(self, task: str) -> str:
        # Your model implementation
        return "model response"

evaluator = StatisticalModelEvaluator(
    cache_dir="./eval_cache",
    log_level="INFO",
    random_seed=42
)

# Prepare your evaluation data
questions = ["Question 1", "Question 2", ...]
answers = ["Answer 1", "Answer 2", ...]

# Run evaluation
result = evaluator.evaluate_model(
    model=MyLanguageModel(),
    questions=questions,
    correct_answers=answers,
    num_samples=3,  # Number of times to sample each question
    batch_size=32,  # Batch size for parallel processing
    cache_key="model_v1"  # Optional caching key
)

# Access results
print(f"Mean Score: {result.mean_score:.3f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")

Handling Clustered Questions

# For questions that are grouped (e.g., multiple questions about the same passage)
cluster_ids = ["passage1", "passage1", "passage2", "passage2", ...]

result = evaluator.evaluate_model(
    model=MyLanguageModel(),
    questions=questions,
    correct_answers=answers,
    cluster_ids=cluster_ids
)

Comparing Models

# Evaluate two models
result_a = evaluator.evaluate_model(model=ModelA(), ...)
result_b = evaluator.evaluate_model(model=ModelB(), ...)

# Compare results
comparison = evaluator.compare_models(result_a, result_b)

print(f"Mean Difference: {comparison['mean_difference']:.3f}")
print(f"P-value: {comparison['p_value']:.4f}")
print(f"Significant Difference: {comparison['significant_difference']}")

Power Analysis

required_samples = evaluator.calculate_required_samples(
    effect_size=0.05,  # Minimum difference to detect
    baseline_variance=0.1,  # Estimated variance in scores
    power=0.8,  # Desired statistical power
    alpha=0.05  # Significance level
)

print(f"Required number of samples: {required_samples}")

🎛️ Configuration Options

Parameter	Description	Default
`cache_dir`	Directory for caching results	`None`
`log_level`	Logging verbosity ("DEBUG", "INFO", etc.)	`"INFO"`
`random_seed`	Seed for reproducibility	`None`
`batch_size`	Batch size for parallel processing	`32`
`num_samples`	Samples per question	`1`

📊 Output Formats

EvalResult Object

@dataclass
class EvalResult:
    mean_score: float      # Average score across questions
    sem: float            # Standard error of the mean
    ci_lower: float       # Lower bound of 95% CI
    ci_upper: float       # Upper bound of 95% CI
    raw_scores: List[float]  # Individual question scores
    metadata: Dict        # Additional evaluation metadata

Comparison Output

{
    "mean_difference": float,    # Difference between means
    "correlation": float,        # Score correlation
    "t_statistic": float,       # T-test statistic
    "p_value": float,           # Statistical significance
    "significant_difference": bool  # True if p < 0.05
}

🔍 Best Practices

Sample Size: Use power analysis to determine appropriate sample sizes
Clustering: Always specify cluster IDs when questions are grouped
Caching: Enable caching for expensive evaluations
Error Handling: Monitor logs for evaluation failures
Reproducibility: Set random seed for consistent results

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙋‍♂️ Support

📫 Email: kye@swarms.world
💬 Issues: GitHub Issues
📖 Documentation: Full Documentation

🙏 Acknowledgments

Thanks to all contributors
Inspired by the paper "A Statistical Approach to Model Evaluations" (2024)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.6

Dec 22, 2024

0.0.5

Dec 14, 2024

0.0.4

Dec 14, 2024

0.0.3

Dec 14, 2024

This version

0.0.2

Dec 11, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalops-0.0.2.tar.gz (14.1 kB view details)

Uploaded Dec 11, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evalops-0.0.2-py3-none-any.whl (12.3 kB view details)

Uploaded Dec 11, 2024 Python 3

File details

Details for the file evalops-0.0.2.tar.gz.

File metadata

Download URL: evalops-0.0.2.tar.gz
Upload date: Dec 11, 2024
Size: 14.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.12.6 Darwin/23.3.0

File hashes

Hashes for evalops-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`022587bb4f096c089143d48408b296e330f61a3c6a4cbc3bbb1aeb708444a053`
MD5	`6597e4701506d68d8827594ae4382685`
BLAKE2b-256	`ab516454c923732536c9b4308b81f8a0bf201f4b9e4330471787ff2b504b8420`

See more details on using hashes here.

File details

Details for the file evalops-0.0.2-py3-none-any.whl.

File metadata

Download URL: evalops-0.0.2-py3-none-any.whl
Upload date: Dec 11, 2024
Size: 12.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.12.6 Darwin/23.3.0

File hashes

Hashes for evalops-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`22fbf48ea88eabdc1bbb8d50643cf9097c4c79a8923a150a80304cca657f5532`
MD5	`d0b0e74512a94ec80985aa78d64ab362`
BLAKE2b-256	`82989e4588783953d8fb2ceb72e1916cb6668a1afffea13446784a3db7fa598d`

See more details on using hashes here.

evalops 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Statistical Model Evaluator

🚀 Features

📋 Requirements

Usage

📖 Detailed Usage

Basic Model Evaluation

Handling Clustered Questions

Comparing Models

Power Analysis

🎛️ Configuration Options

📊 Output Formats

EvalResult Object

Comparison Output

🔍 Best Practices

🤝 Contributing

📄 License

🙋‍♂️ Support

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes