Skip to main content

Comprehensive evaluation library for Gen AI applications using AWS Bedrock

Project description

๐Ÿ“Š EvalMeter

Measure AI Quality with Precision using AWS Bedrock

A comprehensive evaluation framework for Gen AI applications, powered by AWS Bedrock. EvalMeter provides 11 evaluation metrics across heuristic, statistical, and LLM-as-judge methods to help you measure and improve your AI systems.

PyPI version Python 3.9+ License: MIT


๐ŸŽฌ Demo

EvalMeter Demo

Quick demo showing project tracking, experiment comparison, and metrics visualization

Key Features Shown:

  • ๐Ÿ“ Projects - Group related experiments and track progress
  • ๐Ÿ“Š Dashboard - Overview with key statistics
  • ๐Ÿ“ˆ Progress Charts - Visualize improvements over time
  • โš–๏ธ Compare - Side-by-side experiment comparison
  • ๐Ÿ’ฌ Comments - Document changes and insights

โœจ Key Features

  • ๐ŸŽฏ 11 Evaluation Metrics - Heuristic, Statistical, and LLM-as-Judge evaluators
  • ๐Ÿค– AWS Bedrock Powered - Claude Sonnet 4.5 and Titan Embeddings V2
  • ๐Ÿ“Š Multiple Data Formats - CSV, JSONL, JSON, Parquet support
  • ๐Ÿ’พ Local SQLite Storage - Track experiments without external dependencies
  • ๐ŸŽจ Modern Web UI - React dashboard with real-time visualization
  • ๐Ÿ“ Project Tracking - Group experiments and monitor progress over time
  • โšก Simple CLI - One-line commands to run evaluations
  • ๐Ÿ”Œ REST API - FastAPI backend for programmatic access
  • ๐Ÿ“ˆ Progress Charts - Visualize improvement trends
  • ๐Ÿ” Detailed Metrics - Comprehensive scoring and metadata

๐Ÿ“ฆ Installation

pip install evalmeter

Prerequisites

  • Python 3.9 or higher
  • AWS account with Bedrock access
  • AWS credentials configured

AWS Setup

# Configure AWS credentials
aws configure

# Or set environment variables
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1

Required IAM Permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Resource": [
        "arn:aws:bedrock:*::foundation-model/anthropic.claude-*",
        "arn:aws:bedrock:*::foundation-model/amazon.titan-embed-*"
      ]
    }
  ]
}

๐Ÿš€ Quick Start

1. Prepare Your Data

Create a CSV file with your test cases:

input,output,expected
"What is 2+2?","4","4"
"Capital of France?","Paris","Paris"
"Explain photosynthesis","Plants use sunlight to make food","Photosynthesis is how plants convert light energy into chemical energy"

2. Run Evaluation

# Basic evaluation
evalmeter run --data test.csv --evals "exact_match,bleu,rouge"

# With project tracking
evalmeter run --data test.csv \
  --project "chatbot-v2" \
  --experiment "baseline" \
  --comments "Initial baseline test" \
  --evals "factuality,relevance,coherence"

# Comprehensive evaluation (all 11 metrics)
evalmeter run --data test.csv \
  --experiment "comprehensive" \
  --evals "exact_match,fuzzy_match,contains,bleu,rouge,levenshtein,cosine_similarity,factuality,relevance,coherence,completeness"

3. View Results in Web UI

# Launch the web UI
./start-ui.sh

# This starts:
# - API server on http://localhost:8000
# - React UI on http://localhost:5173 (opens automatically)

The web UI provides:

  • ๐Ÿ“Š Dashboard - Overview of all experiments
  • ๐Ÿ“ Projects - Group related experiments and track progress
  • ๐Ÿ“ˆ Progress Charts - Visualize improvements over time
  • ๐Ÿ” Detailed Results - View scores, metrics, and sample-level data
  • โš–๏ธ Compare - Side-by-side experiment comparison
  • ๐Ÿ’ฌ Comments - Document changes and insights for each experiment

CLI Alternative:

# List experiments
evalmeter list

# Show details
evalmeter show <experiment-id>

๐Ÿ“Š Available Evaluators (11 Total)

๐ŸŽฏ Heuristic Evaluators (4)

Evaluator Description Use Case
exact_match Binary exact string match Classification, short answers
fuzzy_match Similarity ratio (0.0-1.0) Typo tolerance, spelling variations
contains Substring matching Long answers, key phrase detection
regex_match Pattern matching Format validation (emails, dates)

๐Ÿ“ˆ Statistical Evaluators (4)

Evaluator Description Use Case
bleu N-gram precision Translation, text generation
rouge Recall-oriented matching Summarization
levenshtein Edit distance similarity Text similarity
cosine_similarity Semantic similarity via embeddings Meaning comparison

๐Ÿค– LLM-as-Judge Evaluators (4)

Evaluator Description Use Case
factuality Factual correctness Accuracy verification
relevance Answer relevance Relevance checking
coherence Response structure Quality assessment
completeness Answer coverage Thoroughness verification

๐Ÿ’ป Python API

from evalmeter import Evaluator

# Initialize
evaluator = Evaluator(
    model_id="us.anthropic.claude-sonnet-4-5-20250929-v1:0",
    aws_region="us-east-1"
)

# Run evaluation
results = evaluator.run(
    data_path="test.csv",
    experiment_name="my-eval",
    project_id="chatbot-v2",
    comments="Testing new prompts",
    evaluators=["factuality", "relevance", "cosine_similarity"]
)

# Print summary
print(results.summary())

# Access metrics
print(f"Factuality: {results.metrics['factuality_mean']:.2f}")
print(f"Relevance: {results.metrics['relevance_mean']:.2f}")

# Iterate results
for result in results:
    print(f"Input: {result['input']}")
    print(f"Scores: {result['scores']}")

๐ŸŽจ Web UI - Visualize Your Results

Launch the interactive dashboard to view and analyze your evaluation results:

./start-ui.sh

This opens the React UI at http://localhost:5173

Dashboard Pages

๐Ÿ“Š Dashboard

Overview of all experiments with key statistics, metrics, and recent activity.

๐Ÿ“ Projects - Track Progress Over Time

Group related experiments to visualize improvements

Create projects to organize experiments and track progress across iterations:

# Baseline
evalmeter run --data test.csv \
  --project "chatbot-v2" \
  --experiment "baseline" \
  --comments "Initial baseline with default prompts" \
  --evals "factuality,relevance,coherence,completeness"

# After improvements
evalmeter run --data test.csv \
  --project "chatbot-v2" \
  --experiment "improved-prompts" \
  --comments "Updated system prompts for better accuracy" \
  --evals "factuality,relevance,coherence,completeness"

# With RAG
evalmeter run --data test.csv \
  --project "chatbot-v2" \
  --experiment "with-rag" \
  --comments "Added RAG with vector database" \
  --evals "factuality,relevance,coherence,completeness"

In the UI:

  1. Navigate to Projects โ†’ chatbot-v2
  2. See all experiments in chronological order
  3. View progress chart showing metric improvements over time
  4. Read comments to understand what changed between versions

๐Ÿ”ฌ Experiments

Browse all evaluation runs, filter by project/status/date, and view summary metrics.

๐Ÿ“ˆ Experiment Details

Click any experiment to see:

  • Metrics Summary - Mean, min, max for all evaluators
  • Sample Results - Individual input/output/expected with scores
  • Comments - Your notes about this experiment
  • Metadata - Model used, dataset, timestamps
  • Configuration - Evaluators used and parameters

โš–๏ธ Compare

Select two experiments to compare side-by-side, view metric differences, and identify improvements or regressions.

๐Ÿ’ฌ Comments & Documentation

Document your experiments for better tracking

Add comments to every experiment explaining what changed, why, and observations:

evalmeter run --data test.csv \
  --project "qa-bot" \
  --experiment "test-5" \
  --comments "Increased temperature to 0.7 for more creative responses. Added context window of 3 previous messages. Results show better coherence but slightly lower factuality."

View these comments in the UI to understand your experimentation history and make informed decisions!

See docs/PROJECT_TRACKING.md for complete guide.

Screenshots

Dashboard

Dashboard with experiment overview and statistics

Projects

Project tracking with progress charts

Experiments

Experiment list with filtering and metrics

Metric Graphs

Detailed metric visualization and trends

Visual Comparison

Side-by-side experiment comparison with detailed metrics


๐Ÿ“– CLI Reference

Run Evaluation

evalmeter run [OPTIONS]

Options:
  -d, --data PATH       Path to data file (required)
  -e, --experiment TEXT Experiment name
  -p, --project TEXT    Project ID for grouping
  -c, --comments TEXT   Experiment notes
  --evals TEXT          Comma-separated evaluators
  --model TEXT          Bedrock model ID
  --region TEXT         AWS region (default: us-east-1)

List Experiments

evalmeter list [OPTIONS]

Options:
  -n, --limit INTEGER   Number to show (default: 10)

Show Details

evalmeter show EXPERIMENT_ID

List Evaluators

evalmeter evaluators

Start API Server

evalmeter-api

๐ŸŽฏ Use Cases

Question Answering

evalmeter run --data qa.csv \
  --evals "cosine_similarity,factuality,relevance,completeness"

Text Generation

evalmeter run --data generation.csv \
  --evals "bleu,rouge,cosine_similarity,coherence"

Summarization

evalmeter run --data summaries.csv \
  --evals "rouge,cosine_similarity,coherence"

๐Ÿ’ฐ Cost Considerations

Evaluator Type Cost Speed
Heuristic Free โšกโšกโšก Instant
Statistical Free โšกโšกโšก Instant
Cosine Similarity AWS Bedrock (Titan Embeddings) โšกโšก Fast
LLM-as-Judge AWS Bedrock (Claude) โšก Moderate

Pricing: See AWS Bedrock Pricing for current rates.

Recommendation: Start with free metrics, add cosine similarity for semantic understanding, use LLM judges for final validation.


๐Ÿ“‚ Project Structure

evalmeter/
โ”œโ”€โ”€ evalmeter/           # Main package
โ”‚   โ”œโ”€โ”€ core/           # Core evaluation logic
โ”‚   โ”‚   โ”œโ”€โ”€ evaluators/ # All evaluator implementations
โ”‚   โ”‚   โ”œโ”€โ”€ data_loader.py
โ”‚   โ”‚   โ””โ”€โ”€ evaluator.py
โ”‚   โ”œโ”€โ”€ storage/        # Database and models
โ”‚   โ”œโ”€โ”€ api/            # FastAPI server
โ”‚   โ”œโ”€โ”€ utils/          # Configuration and utilities
โ”‚   โ””โ”€โ”€ cli.py          # CLI interface
โ”œโ”€โ”€ ui/                 # React web interface
โ”œโ”€โ”€ examples/           # Example data and scripts
โ”œโ”€โ”€ docs/               # Documentation
โ””โ”€โ”€ tests/              # Test suite

๐Ÿ—„๏ธ Data Storage

EvalMeter uses SQLite for local storage:

  • Location: ~/.evalmeter/evalmeter.db
  • Tables: experiments, results, metrics
  • Capacity: Millions of records
  • No external dependencies

๐Ÿ“š Documentation

  • Quick Start: This README
  • Evaluators Guide: See docs/EVALUATORS.md
  • Project Tracking: See docs/PROJECT_TRACKING.md
  • Examples: See examples/ directory

๐Ÿค Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

See CONTRIBUTING.md for detailed guidelines.


๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

  • AWS Bedrock - For providing Claude and Titan models
  • Anthropic - For Claude Sonnet 4.5
  • Amazon - For Titan Embeddings V2
  • NLTK, Rouge, Levenshtein - For statistical metrics

๐Ÿ“ž Support


๐ŸŒŸ Star History

If you find EvalMeter useful, please consider giving it a star on GitHub!


Made with โค๏ธ for the AI community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalmeter-0.1.0.tar.gz (5.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalmeter-0.1.0-py3-none-any.whl (31.5 kB view details)

Uploaded Python 3

File details

Details for the file evalmeter-0.1.0.tar.gz.

File metadata

  • Download URL: evalmeter-0.1.0.tar.gz
  • Upload date:
  • Size: 5.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for evalmeter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 16e613c06c7c92872ca353faf049989751ebfe00f8073d64eb7c786e2cf31e1e
MD5 e857ae9b876926b59c017c63e581a931
BLAKE2b-256 e0f955981622456f26ae8b8bf8a76f49cea3337a73b36bd037112171adbfa0e4

See more details on using hashes here.

File details

Details for the file evalmeter-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: evalmeter-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 31.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for evalmeter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 db68a8bae35d3fcdf4044c13244d26ff0dcd22fb2988b796f2256db8a4dfa98d
MD5 b51a937ac07d0ff89376857b15d23c50
BLAKE2b-256 9199c57b059e22e2dbfed0df2e63283a55d16bdbfd2901131f8da1f90e05cefe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page