Comprehensive evaluation library for Gen AI applications using AWS Bedrock
Project description
๐ EvalMeter
Measure AI Quality with Precision using AWS Bedrock
A comprehensive evaluation framework for Gen AI applications, powered by AWS Bedrock. EvalMeter provides 11 evaluation metrics across heuristic, statistical, and LLM-as-judge methods to help you measure and improve your AI systems.
๐ฌ Demo
Quick demo showing project tracking, experiment comparison, and metrics visualization
Key Features Shown:
- ๐ Projects - Group related experiments and track progress
- ๐ Dashboard - Overview with key statistics
- ๐ Progress Charts - Visualize improvements over time
- โ๏ธ Compare - Side-by-side experiment comparison
- ๐ฌ Comments - Document changes and insights
โจ Key Features
- ๐ฏ 11 Evaluation Metrics - Heuristic, Statistical, and LLM-as-Judge evaluators
- ๐ค AWS Bedrock Powered - Claude Sonnet 4.5 and Titan Embeddings V2
- ๐ Multiple Data Formats - CSV, JSONL, JSON, Parquet support
- ๐พ Local SQLite Storage - Track experiments without external dependencies
- ๐จ Modern Web UI - React dashboard with real-time visualization
- ๐ Project Tracking - Group experiments and monitor progress over time
- โก Simple CLI - One-line commands to run evaluations
- ๐ REST API - FastAPI backend for programmatic access
- ๐ Progress Charts - Visualize improvement trends
- ๐ Detailed Metrics - Comprehensive scoring and metadata
๐ฆ Installation
pip install evalmeter
Prerequisites
- Python 3.9 or higher
- AWS account with Bedrock access
- AWS credentials configured
AWS Setup
# Configure AWS credentials
aws configure
# Or set environment variables
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
Required IAM Permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel"
],
"Resource": [
"arn:aws:bedrock:*::foundation-model/anthropic.claude-*",
"arn:aws:bedrock:*::foundation-model/amazon.titan-embed-*"
]
}
]
}
๐ Quick Start
1. Prepare Your Data
Create a CSV file with your test cases:
input,output,expected
"What is 2+2?","4","4"
"Capital of France?","Paris","Paris"
"Explain photosynthesis","Plants use sunlight to make food","Photosynthesis is how plants convert light energy into chemical energy"
2. Run Evaluation
# Basic evaluation
evalmeter run --data test.csv --evals "exact_match,bleu,rouge"
# With project tracking
evalmeter run --data test.csv \
--project "chatbot-v2" \
--experiment "baseline" \
--comments "Initial baseline test" \
--evals "factuality,relevance,coherence"
# Comprehensive evaluation (all 11 metrics)
evalmeter run --data test.csv \
--experiment "comprehensive" \
--evals "exact_match,fuzzy_match,contains,bleu,rouge,levenshtein,cosine_similarity,factuality,relevance,coherence,completeness"
3. View Results in Web UI
# Launch the web UI
./start-ui.sh
# This starts:
# - API server on http://localhost:8000
# - React UI on http://localhost:5173 (opens automatically)
The web UI provides:
- ๐ Dashboard - Overview of all experiments
- ๐ Projects - Group related experiments and track progress
- ๐ Progress Charts - Visualize improvements over time
- ๐ Detailed Results - View scores, metrics, and sample-level data
- โ๏ธ Compare - Side-by-side experiment comparison
- ๐ฌ Comments - Document changes and insights for each experiment
CLI Alternative:
# List experiments
evalmeter list
# Show details
evalmeter show <experiment-id>
๐ Available Evaluators (11 Total)
๐ฏ Heuristic Evaluators (4)
| Evaluator | Description | Use Case |
|---|---|---|
exact_match |
Binary exact string match | Classification, short answers |
fuzzy_match |
Similarity ratio (0.0-1.0) | Typo tolerance, spelling variations |
contains |
Substring matching | Long answers, key phrase detection |
regex_match |
Pattern matching | Format validation (emails, dates) |
๐ Statistical Evaluators (4)
| Evaluator | Description | Use Case |
|---|---|---|
bleu |
N-gram precision | Translation, text generation |
rouge |
Recall-oriented matching | Summarization |
levenshtein |
Edit distance similarity | Text similarity |
cosine_similarity |
Semantic similarity via embeddings | Meaning comparison |
๐ค LLM-as-Judge Evaluators (4)
| Evaluator | Description | Use Case |
|---|---|---|
factuality |
Factual correctness | Accuracy verification |
relevance |
Answer relevance | Relevance checking |
coherence |
Response structure | Quality assessment |
completeness |
Answer coverage | Thoroughness verification |
๐ป Python API
from evalmeter import Evaluator
# Initialize
evaluator = Evaluator(
model_id="us.anthropic.claude-sonnet-4-5-20250929-v1:0",
aws_region="us-east-1"
)
# Run evaluation
results = evaluator.run(
data_path="test.csv",
experiment_name="my-eval",
project_id="chatbot-v2",
comments="Testing new prompts",
evaluators=["factuality", "relevance", "cosine_similarity"]
)
# Print summary
print(results.summary())
# Access metrics
print(f"Factuality: {results.metrics['factuality_mean']:.2f}")
print(f"Relevance: {results.metrics['relevance_mean']:.2f}")
# Iterate results
for result in results:
print(f"Input: {result['input']}")
print(f"Scores: {result['scores']}")
๐จ Web UI - Visualize Your Results
Launch the interactive dashboard to view and analyze your evaluation results:
./start-ui.sh
This opens the React UI at http://localhost:5173
Dashboard Pages
๐ Dashboard
Overview of all experiments with key statistics, metrics, and recent activity.
๐ Projects - Track Progress Over Time
Group related experiments to visualize improvements
Create projects to organize experiments and track progress across iterations:
# Baseline
evalmeter run --data test.csv \
--project "chatbot-v2" \
--experiment "baseline" \
--comments "Initial baseline with default prompts" \
--evals "factuality,relevance,coherence,completeness"
# After improvements
evalmeter run --data test.csv \
--project "chatbot-v2" \
--experiment "improved-prompts" \
--comments "Updated system prompts for better accuracy" \
--evals "factuality,relevance,coherence,completeness"
# With RAG
evalmeter run --data test.csv \
--project "chatbot-v2" \
--experiment "with-rag" \
--comments "Added RAG with vector database" \
--evals "factuality,relevance,coherence,completeness"
In the UI:
- Navigate to Projects โ chatbot-v2
- See all experiments in chronological order
- View progress chart showing metric improvements over time
- Read comments to understand what changed between versions
๐ฌ Experiments
Browse all evaluation runs, filter by project/status/date, and view summary metrics.
๐ Experiment Details
Click any experiment to see:
- Metrics Summary - Mean, min, max for all evaluators
- Sample Results - Individual input/output/expected with scores
- Comments - Your notes about this experiment
- Metadata - Model used, dataset, timestamps
- Configuration - Evaluators used and parameters
โ๏ธ Compare
Select two experiments to compare side-by-side, view metric differences, and identify improvements or regressions.
๐ฌ Comments & Documentation
Document your experiments for better tracking
Add comments to every experiment explaining what changed, why, and observations:
evalmeter run --data test.csv \
--project "qa-bot" \
--experiment "test-5" \
--comments "Increased temperature to 0.7 for more creative responses. Added context window of 3 previous messages. Results show better coherence but slightly lower factuality."
View these comments in the UI to understand your experimentation history and make informed decisions!
See docs/PROJECT_TRACKING.md for complete guide.
Screenshots
|
Dashboard with experiment overview and statistics |
Project tracking with progress charts |
|
Experiment list with filtering and metrics |
Detailed metric visualization and trends |
|
Side-by-side experiment comparison with detailed metrics |
|
๐ CLI Reference
Run Evaluation
evalmeter run [OPTIONS]
Options:
-d, --data PATH Path to data file (required)
-e, --experiment TEXT Experiment name
-p, --project TEXT Project ID for grouping
-c, --comments TEXT Experiment notes
--evals TEXT Comma-separated evaluators
--model TEXT Bedrock model ID
--region TEXT AWS region (default: us-east-1)
List Experiments
evalmeter list [OPTIONS]
Options:
-n, --limit INTEGER Number to show (default: 10)
Show Details
evalmeter show EXPERIMENT_ID
List Evaluators
evalmeter evaluators
Start API Server
evalmeter-api
๐ฏ Use Cases
Question Answering
evalmeter run --data qa.csv \
--evals "cosine_similarity,factuality,relevance,completeness"
Text Generation
evalmeter run --data generation.csv \
--evals "bleu,rouge,cosine_similarity,coherence"
Summarization
evalmeter run --data summaries.csv \
--evals "rouge,cosine_similarity,coherence"
๐ฐ Cost Considerations
| Evaluator Type | Cost | Speed |
|---|---|---|
| Heuristic | Free | โกโกโก Instant |
| Statistical | Free | โกโกโก Instant |
| Cosine Similarity | AWS Bedrock (Titan Embeddings) | โกโก Fast |
| LLM-as-Judge | AWS Bedrock (Claude) | โก Moderate |
Pricing: See AWS Bedrock Pricing for current rates.
Recommendation: Start with free metrics, add cosine similarity for semantic understanding, use LLM judges for final validation.
๐ Project Structure
evalmeter/
โโโ evalmeter/ # Main package
โ โโโ core/ # Core evaluation logic
โ โ โโโ evaluators/ # All evaluator implementations
โ โ โโโ data_loader.py
โ โ โโโ evaluator.py
โ โโโ storage/ # Database and models
โ โโโ api/ # FastAPI server
โ โโโ utils/ # Configuration and utilities
โ โโโ cli.py # CLI interface
โโโ ui/ # React web interface
โโโ examples/ # Example data and scripts
โโโ docs/ # Documentation
โโโ tests/ # Test suite
๐๏ธ Data Storage
EvalMeter uses SQLite for local storage:
- Location:
~/.evalmeter/evalmeter.db - Tables: experiments, results, metrics
- Capacity: Millions of records
- No external dependencies
๐ Documentation
- Quick Start: This README
- Evaluators Guide: See
docs/EVALUATORS.md - Project Tracking: See
docs/PROJECT_TRACKING.md - Examples: See
examples/directory
๐ค Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
See CONTRIBUTING.md for detailed guidelines.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- AWS Bedrock - For providing Claude and Titan models
- Anthropic - For Claude Sonnet 4.5
- Amazon - For Titan Embeddings V2
- NLTK, Rouge, Levenshtein - For statistical metrics
๐ Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
๐ Star History
If you find EvalMeter useful, please consider giving it a star on GitHub!
Made with โค๏ธ for the AI community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evalmeter-0.1.0.tar.gz.
File metadata
- Download URL: evalmeter-0.1.0.tar.gz
- Upload date:
- Size: 5.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16e613c06c7c92872ca353faf049989751ebfe00f8073d64eb7c786e2cf31e1e
|
|
| MD5 |
e857ae9b876926b59c017c63e581a931
|
|
| BLAKE2b-256 |
e0f955981622456f26ae8b8bf8a76f49cea3337a73b36bd037112171adbfa0e4
|
File details
Details for the file evalmeter-0.1.0-py3-none-any.whl.
File metadata
- Download URL: evalmeter-0.1.0-py3-none-any.whl
- Upload date:
- Size: 31.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db68a8bae35d3fcdf4044c13244d26ff0dcd22fb2988b796f2256db8a4dfa98d
|
|
| MD5 |
b51a937ac07d0ff89376857b15d23c50
|
|
| BLAKE2b-256 |
9199c57b059e22e2dbfed0df2e63283a55d16bdbfd2901131f8da1f90e05cefe
|