Comprehensive AI System Evaluation & Testing Framework

These details have not been verified by PyPI

Project links

Project description

Themis - AI Evaluation & Testing Framework

🏛️ Themis is a comprehensive Python library for evaluating and testing AI systems, with a focus on LLM outputs, bias detection, hallucination measurement, and differential privacy.

Named after Themis, the Greek goddess of justice and divine order, this library aims to bring fairness, transparency, and rigorous evaluation to AI systems.

Installation

# Basic installation
pip install themis-ai-eval

# From Test PyPI (latest development version)
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ themis-ai-eval

# Development installation
git clone https://github.com/ejigsonpeter/themis.git
cd themis
pip install -e .

Quick Start

Command Line Interface (CLI)

1. Run the Interactive Demo

themis demo

This will run a comprehensive demonstration showing bias detection, hallucination detection, and toxicity analysis with example texts.

2. Quick Individual Evaluations

Bias Detection:

themis bias --input "All women are naturally bad at mathematics"
# Output: Shows bias score and analysis

Hallucination Detection:

themis hallucination --input "The Earth is flat" --ground-truth "The Earth is round"
# Output: Shows factual accuracy and consistency scores

Toxicity Detection:

themis toxicity --input "I hate everyone and want to hurt people"
# Output: Shows toxicity score and safety rating

3. Comprehensive Evaluation

Single Text:

themis evaluate --input "Women are naturally worse at programming than men" --evaluators hallucination,bias,toxicity

Multiple Texts from Dataset:

themis evaluate --dataset your_data.json --evaluators hallucination,bias,toxicity --output results.json

Example Dataset Format (your_data.json):

{
  "outputs": [
    "The sky is blue and beautiful today.",
    "All women are bad at mathematics.",
    "I hate everyone and want to destroy everything."
  ],
  "ground_truth": [
    "The sky is blue and beautiful today.",
    "Mathematical ability varies among individuals.",
    "Everyone deserves respect and kindness."
  ],
  "contexts": [
    "Question about weather",
    "Question about gender and abilities", 
    "Question about social attitudes"
  ]
}

4. Advanced Features

Save Results:

themis evaluate --input "Your text here" --evaluators bias,toxicity --output analysis.json --format json

Limit Samples:

themis evaluate --dataset large_dataset.json --max-samples 100 --evaluators hallucination,bias

Verbose Output for Debugging:

themis --verbose evaluate --input "Your text" --evaluators bias

Python API

Basic Usage

from themis import ThemisEvaluator, HallucinationDetector, BiasDetector, ToxicityDetector

# Initialize evaluator
evaluator = ThemisEvaluator()

# Add evaluators
evaluator.add_evaluator(HallucinationDetector())
evaluator.add_evaluator(BiasDetector())
evaluator.add_evaluator(ToxicityDetector())

# Single evaluation
results = evaluator.evaluate(
    model_outputs=["All women are bad drivers"],
    ground_truth=["Driving ability varies by individual"]
)

# Print summary
print(results.summary())

# Detailed results
for result in results.results:
    print(f"\n{result.evaluator_name}:")
    for metric, value in result.metrics.items():
        if isinstance(value, float):
            print(f"  {metric}: {value:.3f}")
        else:
            print(f"  {metric}: {value}")

Batch Evaluation

from themis import ThemisEvaluator, HallucinationDetector, BiasDetector

# Multiple texts
model_outputs = [
    "The sky is green and the grass is blue.",
    "All programmers are male and antisocial.",
    "Python is a programming language for data science."
]

ground_truth = [
    "The sky is blue and the grass is green.",
    "Programmers come from all backgrounds and personalities.",
    "Python is a programming language used for data science."
]

evaluator = ThemisEvaluator()
evaluator.add_evaluator(HallucinationDetector())
evaluator.add_evaluator(BiasDetector())

results = evaluator.evaluate(
    model_outputs=model_outputs,
    ground_truth=ground_truth
)

# Analyze results
summary = results.summary()
print(f"Overall success rate: {summary['success_rate']:.1%}")

# Check each evaluation
for i, result in enumerate(results.results):
    print(f"\nEvaluator: {result.evaluator_name}")
    print(f"Execution time: {result.execution_time:.3f}s")
    
    if result.success:
        for metric, value in result.metrics.items():
            print(f"  {metric}: {value}")

Individual Evaluators

# Bias Detection Only
from themis import ThemisEvaluator, BiasDetector

evaluator = ThemisEvaluator()
evaluator.add_evaluator(BiasDetector())

results = evaluator.evaluate(["All Asians are good at math"])
bias_result = results.results[0]

print("Bias Analysis:")
print(f"Overall bias score: {bias_result.metrics['overall_bias_score']:.3f}")
print(f"High bias instances: {bias_result.metrics['high_bias_instances']}")

# Hallucination Detection Only
from themis import ThemisEvaluator, HallucinationDetector

evaluator = ThemisEvaluator()
evaluator.add_evaluator(HallucinationDetector())

results = evaluator.evaluate(
    model_outputs=["The Earth is flat"],
    ground_truth=["The Earth is round"]
)

hallucination_result = results.results[0]
print("Hallucination Analysis:")
print(f"Accuracy: {hallucination_result.metrics['accuracy']:.3f}")
print(f"Hallucination rate: {hallucination_result.metrics['hallucination_rate']:.3f}")

Differential Privacy

from themis.core.differential_privacy import LaplaceMechanism, GaussianMechanism

# Laplace Mechanism
laplace = LaplaceMechanism(epsilon=1.0)
sensitive_data = [85.5, 90.2, 78.9, 92.1, 88.7]  # e.g., test scores
private_data = laplace.apply(sensitive_data, sensitivity=1.0)

print("Original data:", sensitive_data)
print("Private data:", [round(x, 2) for x in private_data])

# Gaussian Mechanism  
gaussian = GaussianMechanism(epsilon=1.0, delta=1e-5)
private_mean = gaussian.apply(sum(sensitive_data)/len(sensitive_data), sensitivity=0.1)

print(f"Original mean: {sum(sensitive_data)/len(sensitive_data):.2f}")
print(f"Private mean: {private_mean:.2f}")

Model Comparison

from themis.testing import ModelComparison

# Compare models (placeholder implementation)
comparison = ModelComparison()

# In practice, you'd load actual models here
models = {
    'model_a': 'gpt-3.5-turbo',  
    'model_b': 'claude-3-sonnet'
}

test_cases = [
    "Explain quantum computing",
    "What are the benefits of renewable energy?",
    "Describe the causes of climate change"
]

results = comparison.compare_models(
    models=models,
    test_cases=test_cases,
    evaluators=['hallucination', 'bias'],
    baseline_model='model_a'
)

print("Comparison Results:", results)

CLI Reference

Available Commands

Command	Description	Example
`themis demo`	Run interactive demonstration	`themis demo`
`themis evaluate`	Full evaluation with multiple evaluators	`themis evaluate --input "text" --evaluators bias,toxicity`
`themis bias`	Quick bias detection	`themis bias --input "All men are stronger"`
`themis hallucination`	Quick hallucination detection	`themis hallucination --input "Earth is flat" --ground-truth "Earth is round"`
`themis toxicity`	Quick toxicity detection	`themis toxicity --input "I hate everyone"`
`themis version`	Show version and system info	`themis version`

Evaluation Options

Option	Short	Description	Example
`--input`	`-i`	Single text to evaluate	`--input "Your text here"`
`--dataset`	`-d`	JSON dataset file	`--dataset data.json`
`--output`	`-o`	Save results to file	`--output results.json`
`--evaluators`	`-e`	Comma-separated evaluators	`--evaluators bias,toxicity,hallucination`
`--ground-truth`	`-g`	Ground truth for comparison	`--ground-truth "Correct statement"`
`--format`		Output format	`--format json`
`--max-samples`		Limit number of samples	`--max-samples 100`
`--verbose`	`-v`	Enable verbose output	`--verbose`

Features

🔍 Core Evaluators

Hallucination Detection: Measure factual accuracy and consistency
Semantic Similarity: Compare meaning across model outputs
Performance Metrics: Latency, throughput, and resource usage
Robustness Testing: Adversarial and edge case evaluation

🧠 Advanced Evaluators

Bias Detection: Identify and measure various forms of bias
Toxicity Detection: Content safety and harmful output detection
Factual Accuracy: Cross-reference with knowledge bases
Coherence Analysis: Logical consistency and flow evaluation

🔒 Differential Privacy

Privacy Mechanisms: Laplace, Gaussian, and exponential mechanisms
Privacy Metrics: Epsilon-delta privacy analysis
Utility-Privacy Tradeoffs: Measure privacy cost vs. model utility

🧪 A/B Testing Framework

Model Comparison: Statistical significance testing
Performance Benchmarking: Standardized evaluation protocols
Regression Detection: Identify performance degradations

Real-World Usage Examples

Content Moderation

# Analyze user-generated content for toxicity and bias
themis evaluate --dataset user_posts.json --evaluators toxicity,bias --output moderation_results.json

# Quick toxicity check
themis toxicity --input "This comment from a user"

AI Model Testing

# Evaluate LLM outputs for hallucinations
themis hallucination --input "Model generated text" --ground-truth "Known factual information"

# Comprehensive model evaluation
themis evaluate --dataset model_outputs.json --evaluators hallucination,bias,toxicity --output evaluation_report.json

Research and Analysis

# Analyze bias in AI-generated content
themis bias --input "AI generated response about hiring practices"

# Run full evaluation suite with detailed output
themis --verbose evaluate --dataset research_data.json --evaluators hallucination,bias,toxicity,semantic

Educational Assessment

# Evaluate AI tutoring system responses
from themis import ThemisEvaluator, BiasDetector, HallucinationDetector

evaluator = ThemisEvaluator()
evaluator.add_evaluator(BiasDetector())
evaluator.add_evaluator(HallucinationDetector())

tutor_responses = [
    "Boys are naturally better at math than girls",
    "The mitochondria is the powerhouse of the cell",
    "Climate change is a hoax perpetrated by scientists"
]

ground_truth = [
    "Mathematical ability is not determined by gender",
    "The mitochondria is the powerhouse of the cell", 
    "Climate change is supported by scientific consensus"
]

results = evaluator.evaluate(tutor_responses, ground_truth)

# Analyze for educational suitability
for result in results.results:
    if result.evaluator_name == "BiasDetector":
        bias_score = result.metrics['overall_bias_score']
        if bias_score > 0.5:
            print(f"⚠️ High bias detected: {bias_score:.3f}")
    
    elif result.evaluator_name == "HallucinationDetector":
        accuracy = result.metrics['accuracy']
        if accuracy < 0.7:
            print(f"⚠️ Low factual accuracy: {accuracy:.3f}")

Architecture

Themis follows a modular architecture with these key components:

Core Engine: Orchestrates evaluation workflows
Evaluator Framework: Pluggable evaluation modules
CLI Interface: Command-line tools for easy usage
Privacy Module: Differential privacy mechanisms
Testing Framework: A/B testing and model comparison

Contributing

We welcome contributions! To get started:

# Development setup
git clone https://github.com/ejigsonpeter/themis.git
cd themis
pip install -e .[dev]

# Run tests
python minimal_test.py

# Test CLI
themis demo

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use Themis in your research, please cite:

@software{themis2024,
  title={Themis: AI Evaluation and Testing Framework},
  author={Themis Team},
  year={2024},
  url={https://github.com/ejigsonpeter/themis}
}

Support

GitHub Issues

Get Started Today:

pip install themis-ai-eval
themis demo

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0

Jun 7, 2025

1.0

Jun 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

themis_ai_eval-2.0.tar.gz (38.0 kB view details)

Uploaded Jun 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

themis_ai_eval-2.0-py3-none-any.whl (36.1 kB view details)

Uploaded Jun 7, 2025 Python 3

File details

Details for the file themis_ai_eval-2.0.tar.gz.

File metadata

Download URL: themis_ai_eval-2.0.tar.gz
Upload date: Jun 7, 2025
Size: 38.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for themis_ai_eval-2.0.tar.gz
Algorithm	Hash digest
SHA256	`2e58ed9b3270b28f25434c243641cedf48ddc4472a7d56ed94391fa2722a2881`
MD5	`ef7e45b94f188b7d9df1f2b3ef80db95`
BLAKE2b-256	`d9a48f338dd0643a33293d324ef09f7dd765cefac9d482dfe6dcd4cfd472a2a9`

See more details on using hashes here.

File details

Details for the file themis_ai_eval-2.0-py3-none-any.whl.

File metadata

Download URL: themis_ai_eval-2.0-py3-none-any.whl
Upload date: Jun 7, 2025
Size: 36.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for themis_ai_eval-2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e84b530e95812aa61894d539eb7194ced0e8fdf5f29969cf87bf3cb766545cd3`
MD5	`bedfac43a355d4109ad463e8378ee047`
BLAKE2b-256	`9f164c7e1ebbfb5f5fd73ad5d5c17522306c406bdea20cfe73a8d013be29a9f4`

See more details on using hashes here.

themis-ai-eval 2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Themis - AI Evaluation & Testing Framework

Installation

Quick Start

Command Line Interface (CLI)

1. Run the Interactive Demo

2. Quick Individual Evaluations

3. Comprehensive Evaluation

4. Advanced Features

Python API

Basic Usage

Batch Evaluation

Individual Evaluators

Differential Privacy

Model Comparison

CLI Reference

Available Commands

Evaluation Options

Features

🔍 Core Evaluators

🧠 Advanced Evaluators

🔒 Differential Privacy

🧪 A/B Testing Framework

Real-World Usage Examples

Content Moderation

AI Model Testing

Research and Analysis

Educational Assessment

Architecture

Contributing

License

Citation

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes