Lightweight, modular library for evaluating LLM outputs with built-in metrics, custom extensions, and agentic workflows.

Project description

EvalBench

EvalBench is a plug-and-play Python package for evaluating outputs of large language models (LLMs) across a variety of metrics - from response quality and retrieval accuracy to hallucination and prompt alignment.

It now includes agentic workflows: just describe what you want to understand or improve about your LLM outputs, and EvalBench will plan and execute a tailored sequence of evaluation, interpretation, and recommendation steps — automatically!

🚀 Key Features:

18+ built-in metrics covering coherence, relevance, hallucination, BLEU, ROUGE, MRR, and more
User-defined custom metrics with a simple decorator-based API
Modular architecture to group related metrics and share inputs
Agentic execution: EvalBench can reason about your goal and execute the necessary steps (evaluate → interpret → recommend)
Batch support, configurable output (print/save), and JSON-compatible results

📊 Modules and Metric Categories:

Module	Metrics
response_quality	conciseness_score, coherence_score, factuality_score
reference_based	bleu_score, rouge_score, meteor_score, semantic_similarity_score, bert_score
contextual_generation	faithfulness_score, hallucination_score, groundedness_score
retrieval	recall_at_k_score, precision_at_k_score, ndcg_at_k_score, mrr_score
query_alignment	context_relevance_score
response_alignment	response_relevance_score, response_helpfulness_score
user defined module	User-registered custom metrics

🧠 Agentic Workflow:

EvalBench follows a three-step agentic pipeline, automatically triggered based on user instructions:

Evaluation – Runs relevant metrics to score model outputs. EvalBench intelligently selects which metrics to use if not explicitly specified.
Interpretation – Analyzes the evaluation results and highlights potential issues with model behavior.
Recommendation – Suggests improvements to prompts, model setup, data inputs, or evaluation strategy.

Just write your request in plain language — EvalBench will take care of the rest.

🚀 Usage

pip install evalbench

All usage examples, including how to write your own custom metrics and how to use the agentic pipeline in practice, are available in this Jupyter notebook:

👉 View the Notebook

💡 Use Cases

EvalBench is ideal for:

Evaluating LLM apps like summarizers, chatbots, and search agents using built-in metrics
Integrating custom, domain-specific metrics into the EvalBench's ecosystem
Getting automatic eval → interpret → recommend pipelines from natural language instructions
Rapidly iterating on model outputs, prompts, and evaluation strategies

🚧 Coming Soon

Dataset evaluation integration
Ecosystem integration - langchain/llama_index hooks
CLI support

Project details

Release history Release notifications | RSS feed

1.0.1

Jun 10, 2025

This version

0.7.0

Jun 10, 2025

0.1.4

May 29, 2025

0.1.3

May 29, 2025

0.1.2

May 28, 2025

0.1.1

May 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalbench-0.7.0.tar.gz (22.5 kB view details)

Uploaded Jun 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evalbench-0.7.0-py3-none-any.whl (30.6 kB view details)

Uploaded Jun 10, 2025 Python 3

File details

Details for the file evalbench-0.7.0.tar.gz.

File metadata

Download URL: evalbench-0.7.0.tar.gz
Upload date: Jun 10, 2025
Size: 22.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for evalbench-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`11e500f2b621021708de6691ca027e79032f3e388a818971a89e805a677a7dda`
MD5	`ea4823dd6500ed60f7dcc87e78e3dbb9`
BLAKE2b-256	`ff32199e50aafbf9306154191dd6ef32a6e1e8d087aac48cf2840ddcf9a842d8`

See more details on using hashes here.

File details

Details for the file evalbench-0.7.0-py3-none-any.whl.

File metadata

Download URL: evalbench-0.7.0-py3-none-any.whl
Upload date: Jun 10, 2025
Size: 30.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for evalbench-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f1e021381acae512c39c966b72eb9af714ea8197629f40482b00013c52959c8e`
MD5	`9b60095850e35098f7e9a13709c6c243`
BLAKE2b-256	`d1e534e840e0b3a70f268020d6c09a385a77650329d7472232a92d97c36a36bd`

See more details on using hashes here.

evalbench 0.7.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

EvalBench

🚀 Key Features:

📊 Modules and Metric Categories:

🧠 Agentic Workflow:

🚀 Usage

💡 Use Cases

🚧 Coming Soon

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes