Comprehensive evaluation library for Gen AI applications using AWS Bedrock

These details have not been verified by PyPI

Project links

Project description

📊 EvalMeter

Measure AI Quality with Precision using AWS Bedrock

A comprehensive evaluation framework for Gen AI applications, powered by AWS Bedrock. EvalMeter provides 11 evaluation metrics across heuristic, statistical, and LLM-as-judge methods to help you measure and improve your AI systems.

🎬 Demo

EvalMeter Demo

Quick demo showing project tracking, experiment comparison, and metrics visualization

Key Features Shown:

📁 Projects - Group related experiments and track progress
📊 Dashboard - Overview with key statistics
📈 Progress Charts - Visualize improvements over time
⚖️ Compare - Side-by-side experiment comparison
💬 Comments - Document changes and insights

✨ Key Features

🎯 11 Evaluation Metrics - Heuristic, Statistical, and LLM-as-Judge evaluators
🤖 AWS Bedrock Powered - Claude Sonnet 4.5 and Titan Embeddings V2
📊 Multiple Data Formats - CSV, JSONL, JSON, Parquet support
💾 Local SQLite Storage - Track experiments without external dependencies
🎨 Modern Web UI - React dashboard with real-time visualization
📁 Project Tracking - Group experiments and monitor progress over time
⚡ Simple CLI - One-line commands to run evaluations
🔌 REST API - FastAPI backend for programmatic access
📈 Progress Charts - Visualize improvement trends
🔍 Detailed Metrics - Comprehensive scoring and metadata

📦 Installation

pip install evalmeter

Prerequisites

Python 3.9 or higher
AWS account with Bedrock access
AWS credentials configured

AWS Setup

# Configure AWS credentials
aws configure

# Or set environment variables
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1

Required IAM Permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Resource": [
        "arn:aws:bedrock:*::foundation-model/anthropic.claude-*",
        "arn:aws:bedrock:*::foundation-model/amazon.titan-embed-*"
      ]
    }
  ]
}

🚀 Quick Start

1. Prepare Your Data

Create a CSV file with your test cases:

input,output,expected
"What is 2+2?","4","4"
"Capital of France?","Paris","Paris"
"Explain photosynthesis","Plants use sunlight to make food","Photosynthesis is how plants convert light energy into chemical energy"

2. Run Evaluation

# Basic evaluation
evalmeter run --data test.csv --evals "exact_match,bleu,rouge"

# With project tracking
evalmeter run --data test.csv \
  --project "chatbot-v2" \
  --experiment "baseline" \
  --comments "Initial baseline test" \
  --evals "factuality,relevance,coherence"

# Comprehensive evaluation (all 11 metrics)
evalmeter run --data test.csv \
  --experiment "comprehensive" \
  --evals "exact_match,fuzzy_match,contains,bleu,rouge,levenshtein,cosine_similarity,factuality,relevance,coherence,completeness"

3. View Results in Web UI

# Launch the web UI
./start-ui.sh

# This starts:
# - API server on http://localhost:8000
# - React UI on http://localhost:5173 (opens automatically)

The web UI provides:

📊 Dashboard - Overview of all experiments
📁 Projects - Group related experiments and track progress
📈 Progress Charts - Visualize improvements over time
🔍 Detailed Results - View scores, metrics, and sample-level data
⚖️ Compare - Side-by-side experiment comparison
💬 Comments - Document changes and insights for each experiment

CLI Alternative:

# List experiments
evalmeter list

# Show details
evalmeter show <experiment-id>

📊 Available Evaluators (11 Total)

🎯 Heuristic Evaluators (4)

Evaluator	Description	Use Case
`exact_match`	Binary exact string match	Classification, short answers
`fuzzy_match`	Similarity ratio (0.0-1.0)	Typo tolerance, spelling variations
`contains`	Substring matching	Long answers, key phrase detection
`regex_match`	Pattern matching	Format validation (emails, dates)

📈 Statistical Evaluators (4)

Evaluator	Description	Use Case
`bleu`	N-gram precision	Translation, text generation
`rouge`	Recall-oriented matching	Summarization
`levenshtein`	Edit distance similarity	Text similarity
`cosine_similarity`	Semantic similarity via embeddings	Meaning comparison

🤖 LLM-as-Judge Evaluators (4)

Evaluator	Description	Use Case
`factuality`	Factual correctness	Accuracy verification
`relevance`	Answer relevance	Relevance checking
`coherence`	Response structure	Quality assessment
`completeness`	Answer coverage	Thoroughness verification

💻 Python API

from evalmeter import Evaluator

# Initialize
evaluator = Evaluator(
    model_id="us.anthropic.claude-sonnet-4-5-20250929-v1:0",
    aws_region="us-east-1"
)

# Run evaluation
results = evaluator.run(
    data_path="test.csv",
    experiment_name="my-eval",
    project_id="chatbot-v2",
    comments="Testing new prompts",
    evaluators=["factuality", "relevance", "cosine_similarity"]
)

# Print summary
print(results.summary())

# Access metrics
print(f"Factuality: {results.metrics['factuality_mean']:.2f}")
print(f"Relevance: {results.metrics['relevance_mean']:.2f}")

# Iterate results
for result in results:
    print(f"Input: {result['input']}")
    print(f"Scores: {result['scores']}")

🎨 Web UI - Visualize Your Results

Launch the interactive dashboard to view and analyze your evaluation results:

./start-ui.sh

This opens the React UI at http://localhost:5173

Dashboard Pages

📊 Dashboard

Overview of all experiments with key statistics, metrics, and recent activity.

📁 Projects - Track Progress Over Time

Group related experiments to visualize improvements

Create projects to organize experiments and track progress across iterations:

# Baseline
evalmeter run --data test.csv \
  --project "chatbot-v2" \
  --experiment "baseline" \
  --comments "Initial baseline with default prompts" \
  --evals "factuality,relevance,coherence,completeness"

# After improvements
evalmeter run --data test.csv \
  --project "chatbot-v2" \
  --experiment "improved-prompts" \
  --comments "Updated system prompts for better accuracy" \
  --evals "factuality,relevance,coherence,completeness"

# With RAG
evalmeter run --data test.csv \
  --project "chatbot-v2" \
  --experiment "with-rag" \
  --comments "Added RAG with vector database" \
  --evals "factuality,relevance,coherence,completeness"

In the UI:

Navigate to Projects → chatbot-v2
See all experiments in chronological order
View progress chart showing metric improvements over time
Read comments to understand what changed between versions

🔬 Experiments

Browse all evaluation runs, filter by project/status/date, and view summary metrics.

📈 Experiment Details

Click any experiment to see:

Metrics Summary - Mean, min, max for all evaluators
Sample Results - Individual input/output/expected with scores
Comments - Your notes about this experiment
Metadata - Model used, dataset, timestamps
Configuration - Evaluators used and parameters

⚖️ Compare

Select two experiments to compare side-by-side, view metric differences, and identify improvements or regressions.

💬 Comments & Documentation

Document your experiments for better tracking

Add comments to every experiment explaining what changed, why, and observations:

evalmeter run --data test.csv \
  --project "qa-bot" \
  --experiment "test-5" \
  --comments "Increased temperature to 0.7 for more creative responses. Added context window of 3 previous messages. Results show better coherence but slightly lower factuality."

View these comments in the UI to understand your experimentation history and make informed decisions!

See docs/PROJECT_TRACKING.md for complete guide.

Screenshots

Dashboard with experiment overview and statistics	Project tracking with progress charts
Experiment list with filtering and metrics	Detailed metric visualization and trends
Side-by-side experiment comparison with detailed metrics

📖 CLI Reference

Run Evaluation

evalmeter run [OPTIONS]

Options:
  -d, --data PATH       Path to data file (required)
  -e, --experiment TEXT Experiment name
  -p, --project TEXT    Project ID for grouping
  -c, --comments TEXT   Experiment notes
  --evals TEXT          Comma-separated evaluators
  --model TEXT          Bedrock model ID
  --region TEXT         AWS region (default: us-east-1)

List Experiments

evalmeter list [OPTIONS]

Options:
  -n, --limit INTEGER   Number to show (default: 10)

Show Details

evalmeter show EXPERIMENT_ID

List Evaluators

evalmeter evaluators

Start API Server

evalmeter-api

🎯 Use Cases

Question Answering

evalmeter run --data qa.csv \
  --evals "cosine_similarity,factuality,relevance,completeness"

Text Generation

evalmeter run --data generation.csv \
  --evals "bleu,rouge,cosine_similarity,coherence"

Summarization

evalmeter run --data summaries.csv \
  --evals "rouge,cosine_similarity,coherence"

💰 Cost Considerations

Evaluator Type	Cost	Speed
Heuristic	Free	⚡⚡⚡ Instant
Statistical	Free	⚡⚡⚡ Instant
Cosine Similarity	AWS Bedrock (Titan Embeddings)	⚡⚡ Fast
LLM-as-Judge	AWS Bedrock (Claude)	⚡ Moderate

Pricing: See AWS Bedrock Pricing for current rates.

Recommendation: Start with free metrics, add cosine similarity for semantic understanding, use LLM judges for final validation.

📂 Project Structure

evalmeter/
├── evalmeter/           # Main package
│   ├── core/           # Core evaluation logic
│   │   ├── evaluators/ # All evaluator implementations
│   │   ├── data_loader.py
│   │   └── evaluator.py
│   ├── storage/        # Database and models
│   ├── api/            # FastAPI server
│   ├── utils/          # Configuration and utilities
│   └── cli.py          # CLI interface
├── ui/                 # React web interface
├── examples/           # Example data and scripts
├── docs/               # Documentation
└── tests/              # Test suite

🗄️ Data Storage

EvalMeter uses SQLite for local storage:

Location: ~/.evalmeter/evalmeter.db
Tables: experiments, results, metrics
Capacity: Millions of records
No external dependencies

📚 Documentation

Quick Start: This README
Evaluators Guide: See docs/EVALUATORS.md
Project Tracking: See docs/PROJECT_TRACKING.md
Examples: See examples/ directory

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

See CONTRIBUTING.md for detailed guidelines.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

AWS Bedrock - For providing Claude and Titan models
Anthropic - For Claude Sonnet 4.5
Amazon - For Titan Embeddings V2
NLTK, Rouge, Levenshtein - For statistical metrics

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions

🌟 Star History

If you find EvalMeter useful, please consider giving it a star on GitHub!

Made with ❤️ for the AI community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Nov 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalmeter-0.1.0.tar.gz (5.9 MB view details)

Uploaded Nov 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evalmeter-0.1.0-py3-none-any.whl (31.5 kB view details)

Uploaded Nov 28, 2025 Python 3

File details

Details for the file evalmeter-0.1.0.tar.gz.

File metadata

Download URL: evalmeter-0.1.0.tar.gz
Upload date: Nov 28, 2025
Size: 5.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for evalmeter-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`16e613c06c7c92872ca353faf049989751ebfe00f8073d64eb7c786e2cf31e1e`
MD5	`e857ae9b876926b59c017c63e581a931`
BLAKE2b-256	`e0f955981622456f26ae8b8bf8a76f49cea3337a73b36bd037112171adbfa0e4`

See more details on using hashes here.

File details

Details for the file evalmeter-0.1.0-py3-none-any.whl.

File metadata

Download URL: evalmeter-0.1.0-py3-none-any.whl
Upload date: Nov 28, 2025
Size: 31.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for evalmeter-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`db68a8bae35d3fcdf4044c13244d26ff0dcd22fb2988b796f2256db8a4dfa98d`
MD5	`b51a937ac07d0ff89376857b15d23c50`
BLAKE2b-256	`9199c57b059e22e2dbfed0df2e63283a55d16bdbfd2901131f8da1f90e05cefe`

See more details on using hashes here.

evalmeter 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📊 EvalMeter

🎬 Demo

Key Features Shown:

✨ Key Features

📦 Installation

Prerequisites

AWS Setup

🚀 Quick Start

1. Prepare Your Data

2. Run Evaluation

3. View Results in Web UI

📊 Available Evaluators (11 Total)

🎯 Heuristic Evaluators (4)

📈 Statistical Evaluators (4)

🤖 LLM-as-Judge Evaluators (4)

💻 Python API

🎨 Web UI - Visualize Your Results

Dashboard Pages

📊 Dashboard

📁 Projects - Track Progress Over Time

🔬 Experiments

📈 Experiment Details

⚖️ Compare

💬 Comments & Documentation

Screenshots

📖 CLI Reference

Run Evaluation

List Experiments

Show Details

List Evaluators

Start API Server

🎯 Use Cases

Question Answering

Text Generation

Summarization

💰 Cost Considerations

📂 Project Structure

🗄️ Data Storage

📚 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

🌟 Star History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes