Comprehensive AI System Evaluation & Testing Framework
Project description
Themis - AI Evaluation & Testing Framework
🏛️ Themis is a comprehensive Python library for evaluating and testing AI systems, with a focus on LLM outputs, bias detection, hallucination measurement, and differential privacy.
Named after Themis, the Greek goddess of justice and divine order, this library aims to bring fairness, transparency, and rigorous evaluation to AI systems.
Installation
# Basic installation
pip install themis-ai-eval
# From Test PyPI (latest development version)
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ themis-ai-eval
# Development installation
git clone https://github.com/ejigsonpeter/themis.git
cd themis
pip install -e .
Quick Start
Command Line Interface (CLI)
1. Run the Interactive Demo
themis demo
This will run a comprehensive demonstration showing bias detection, hallucination detection, and toxicity analysis with example texts.
2. Quick Individual Evaluations
Bias Detection:
themis bias --input "All women are naturally bad at mathematics"
# Output: Shows bias score and analysis
Hallucination Detection:
themis hallucination --input "The Earth is flat" --ground-truth "The Earth is round"
# Output: Shows factual accuracy and consistency scores
Toxicity Detection:
themis toxicity --input "I hate everyone and want to hurt people"
# Output: Shows toxicity score and safety rating
3. Comprehensive Evaluation
Single Text:
themis evaluate --input "Women are naturally worse at programming than men" --evaluators hallucination,bias,toxicity
Multiple Texts from Dataset:
themis evaluate --dataset your_data.json --evaluators hallucination,bias,toxicity --output results.json
Example Dataset Format (your_data.json):
{
"outputs": [
"The sky is blue and beautiful today.",
"All women are bad at mathematics.",
"I hate everyone and want to destroy everything."
],
"ground_truth": [
"The sky is blue and beautiful today.",
"Mathematical ability varies among individuals.",
"Everyone deserves respect and kindness."
],
"contexts": [
"Question about weather",
"Question about gender and abilities",
"Question about social attitudes"
]
}
4. Advanced Features
Save Results:
themis evaluate --input "Your text here" --evaluators bias,toxicity --output analysis.json --format json
Limit Samples:
themis evaluate --dataset large_dataset.json --max-samples 100 --evaluators hallucination,bias
Verbose Output for Debugging:
themis --verbose evaluate --input "Your text" --evaluators bias
Python API
Basic Usage
from themis import ThemisEvaluator, HallucinationDetector, BiasDetector, ToxicityDetector
# Initialize evaluator
evaluator = ThemisEvaluator()
# Add evaluators
evaluator.add_evaluator(HallucinationDetector())
evaluator.add_evaluator(BiasDetector())
evaluator.add_evaluator(ToxicityDetector())
# Single evaluation
results = evaluator.evaluate(
model_outputs=["All women are bad drivers"],
ground_truth=["Driving ability varies by individual"]
)
# Print summary
print(results.summary())
# Detailed results
for result in results.results:
print(f"\n{result.evaluator_name}:")
for metric, value in result.metrics.items():
if isinstance(value, float):
print(f" {metric}: {value:.3f}")
else:
print(f" {metric}: {value}")
Batch Evaluation
from themis import ThemisEvaluator, HallucinationDetector, BiasDetector
# Multiple texts
model_outputs = [
"The sky is green and the grass is blue.",
"All programmers are male and antisocial.",
"Python is a programming language for data science."
]
ground_truth = [
"The sky is blue and the grass is green.",
"Programmers come from all backgrounds and personalities.",
"Python is a programming language used for data science."
]
evaluator = ThemisEvaluator()
evaluator.add_evaluator(HallucinationDetector())
evaluator.add_evaluator(BiasDetector())
results = evaluator.evaluate(
model_outputs=model_outputs,
ground_truth=ground_truth
)
# Analyze results
summary = results.summary()
print(f"Overall success rate: {summary['success_rate']:.1%}")
# Check each evaluation
for i, result in enumerate(results.results):
print(f"\nEvaluator: {result.evaluator_name}")
print(f"Execution time: {result.execution_time:.3f}s")
if result.success:
for metric, value in result.metrics.items():
print(f" {metric}: {value}")
Individual Evaluators
# Bias Detection Only
from themis import ThemisEvaluator, BiasDetector
evaluator = ThemisEvaluator()
evaluator.add_evaluator(BiasDetector())
results = evaluator.evaluate(["All Asians are good at math"])
bias_result = results.results[0]
print("Bias Analysis:")
print(f"Overall bias score: {bias_result.metrics['overall_bias_score']:.3f}")
print(f"High bias instances: {bias_result.metrics['high_bias_instances']}")
# Hallucination Detection Only
from themis import ThemisEvaluator, HallucinationDetector
evaluator = ThemisEvaluator()
evaluator.add_evaluator(HallucinationDetector())
results = evaluator.evaluate(
model_outputs=["The Earth is flat"],
ground_truth=["The Earth is round"]
)
hallucination_result = results.results[0]
print("Hallucination Analysis:")
print(f"Accuracy: {hallucination_result.metrics['accuracy']:.3f}")
print(f"Hallucination rate: {hallucination_result.metrics['hallucination_rate']:.3f}")
Differential Privacy
from themis.core.differential_privacy import LaplaceMechanism, GaussianMechanism
# Laplace Mechanism
laplace = LaplaceMechanism(epsilon=1.0)
sensitive_data = [85.5, 90.2, 78.9, 92.1, 88.7] # e.g., test scores
private_data = laplace.apply(sensitive_data, sensitivity=1.0)
print("Original data:", sensitive_data)
print("Private data:", [round(x, 2) for x in private_data])
# Gaussian Mechanism
gaussian = GaussianMechanism(epsilon=1.0, delta=1e-5)
private_mean = gaussian.apply(sum(sensitive_data)/len(sensitive_data), sensitivity=0.1)
print(f"Original mean: {sum(sensitive_data)/len(sensitive_data):.2f}")
print(f"Private mean: {private_mean:.2f}")
Model Comparison
from themis.testing import ModelComparison
# Compare models (placeholder implementation)
comparison = ModelComparison()
# In practice, you'd load actual models here
models = {
'model_a': 'gpt-3.5-turbo',
'model_b': 'claude-3-sonnet'
}
test_cases = [
"Explain quantum computing",
"What are the benefits of renewable energy?",
"Describe the causes of climate change"
]
results = comparison.compare_models(
models=models,
test_cases=test_cases,
evaluators=['hallucination', 'bias'],
baseline_model='model_a'
)
print("Comparison Results:", results)
CLI Reference
Available Commands
| Command | Description | Example |
|---|---|---|
themis demo |
Run interactive demonstration | themis demo |
themis evaluate |
Full evaluation with multiple evaluators | themis evaluate --input "text" --evaluators bias,toxicity |
themis bias |
Quick bias detection | themis bias --input "All men are stronger" |
themis hallucination |
Quick hallucination detection | themis hallucination --input "Earth is flat" --ground-truth "Earth is round" |
themis toxicity |
Quick toxicity detection | themis toxicity --input "I hate everyone" |
themis version |
Show version and system info | themis version |
Evaluation Options
| Option | Short | Description | Example |
|---|---|---|---|
--input |
-i |
Single text to evaluate | --input "Your text here" |
--dataset |
-d |
JSON dataset file | --dataset data.json |
--output |
-o |
Save results to file | --output results.json |
--evaluators |
-e |
Comma-separated evaluators | --evaluators bias,toxicity,hallucination |
--ground-truth |
-g |
Ground truth for comparison | --ground-truth "Correct statement" |
--format |
Output format | --format json |
|
--max-samples |
Limit number of samples | --max-samples 100 |
|
--verbose |
-v |
Enable verbose output | --verbose |
Features
🔍 Core Evaluators
- Hallucination Detection: Measure factual accuracy and consistency
- Semantic Similarity: Compare meaning across model outputs
- Performance Metrics: Latency, throughput, and resource usage
- Robustness Testing: Adversarial and edge case evaluation
🧠 Advanced Evaluators
- Bias Detection: Identify and measure various forms of bias
- Toxicity Detection: Content safety and harmful output detection
- Factual Accuracy: Cross-reference with knowledge bases
- Coherence Analysis: Logical consistency and flow evaluation
🔒 Differential Privacy
- Privacy Mechanisms: Laplace, Gaussian, and exponential mechanisms
- Privacy Metrics: Epsilon-delta privacy analysis
- Utility-Privacy Tradeoffs: Measure privacy cost vs. model utility
🧪 A/B Testing Framework
- Model Comparison: Statistical significance testing
- Performance Benchmarking: Standardized evaluation protocols
- Regression Detection: Identify performance degradations
Real-World Usage Examples
Content Moderation
# Analyze user-generated content for toxicity and bias
themis evaluate --dataset user_posts.json --evaluators toxicity,bias --output moderation_results.json
# Quick toxicity check
themis toxicity --input "This comment from a user"
AI Model Testing
# Evaluate LLM outputs for hallucinations
themis hallucination --input "Model generated text" --ground-truth "Known factual information"
# Comprehensive model evaluation
themis evaluate --dataset model_outputs.json --evaluators hallucination,bias,toxicity --output evaluation_report.json
Research and Analysis
# Analyze bias in AI-generated content
themis bias --input "AI generated response about hiring practices"
# Run full evaluation suite with detailed output
themis --verbose evaluate --dataset research_data.json --evaluators hallucination,bias,toxicity,semantic
Educational Assessment
# Evaluate AI tutoring system responses
from themis import ThemisEvaluator, BiasDetector, HallucinationDetector
evaluator = ThemisEvaluator()
evaluator.add_evaluator(BiasDetector())
evaluator.add_evaluator(HallucinationDetector())
tutor_responses = [
"Boys are naturally better at math than girls",
"The mitochondria is the powerhouse of the cell",
"Climate change is a hoax perpetrated by scientists"
]
ground_truth = [
"Mathematical ability is not determined by gender",
"The mitochondria is the powerhouse of the cell",
"Climate change is supported by scientific consensus"
]
results = evaluator.evaluate(tutor_responses, ground_truth)
# Analyze for educational suitability
for result in results.results:
if result.evaluator_name == "BiasDetector":
bias_score = result.metrics['overall_bias_score']
if bias_score > 0.5:
print(f"⚠️ High bias detected: {bias_score:.3f}")
elif result.evaluator_name == "HallucinationDetector":
accuracy = result.metrics['accuracy']
if accuracy < 0.7:
print(f"⚠️ Low factual accuracy: {accuracy:.3f}")
Architecture
Themis follows a modular architecture with these key components:
- Core Engine: Orchestrates evaluation workflows
- Evaluator Framework: Pluggable evaluation modules
- CLI Interface: Command-line tools for easy usage
- Privacy Module: Differential privacy mechanisms
- Testing Framework: A/B testing and model comparison
Contributing
We welcome contributions! To get started:
# Development setup
git clone https://github.com/ejigsonpeter/themis.git
cd themis
pip install -e .[dev]
# Run tests
python minimal_test.py
# Test CLI
themis demo
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use Themis in your research, please cite:
@software{themis2024,
title={Themis: AI Evaluation and Testing Framework},
author={Themis Team},
year={2024},
url={https://github.com/ejigsonpeter/themis}
}
Support
Get Started Today:
pip install themis-ai-eval
themis demo
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file themis_ai_eval-2.0.tar.gz.
File metadata
- Download URL: themis_ai_eval-2.0.tar.gz
- Upload date:
- Size: 38.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e58ed9b3270b28f25434c243641cedf48ddc4472a7d56ed94391fa2722a2881
|
|
| MD5 |
ef7e45b94f188b7d9df1f2b3ef80db95
|
|
| BLAKE2b-256 |
d9a48f338dd0643a33293d324ef09f7dd765cefac9d482dfe6dcd4cfd472a2a9
|
File details
Details for the file themis_ai_eval-2.0-py3-none-any.whl.
File metadata
- Download URL: themis_ai_eval-2.0-py3-none-any.whl
- Upload date:
- Size: 36.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e84b530e95812aa61894d539eb7194ced0e8fdf5f29969cf87bf3cb766545cd3
|
|
| MD5 |
bedfac43a355d4109ad463e8378ee047
|
|
| BLAKE2b-256 |
9f164c7e1ebbfb5f5fd73ad5d5c17522306c406bdea20cfe73a8d013be29a9f4
|