Lightweight, modular library for evaluating LLM outputs with built-in metrics, custom extensions, and agentic workflows.
Project description
EvalBench
EvalBench is a plug-and-play Python package for evaluating outputs of large language models (LLMs) across a variety of metrics - from response quality and retrieval accuracy to hallucination and prompt alignment.
It now includes agentic workflows: just describe what you want to understand or improve about your LLM outputs, and EvalBench will plan and execute a tailored sequence of evaluation, interpretation, and recommendation steps — automatically!
🚀 Key Features:
- 18+ built-in metrics covering coherence, relevance, hallucination, BLEU, ROUGE, MRR, and more
- User-defined custom metrics with a simple decorator-based API
- Modular architecture to group related metrics and share inputs
- Agentic execution: EvalBench can reason about your goal and execute the necessary steps (evaluate → interpret → recommend)
- Batch support, configurable output (print/save), and JSON-compatible results
📊 Modules and Metric Categories:
| Module | Metrics |
|---|---|
| response_quality | conciseness_score, coherence_score, factuality_score |
| reference_based | bleu_score, rouge_score, meteor_score, semantic_similarity_score, bert_score |
| contextual_generation | faithfulness_score, hallucination_score, groundedness_score |
| retrieval | recall_at_k_score, precision_at_k_score, ndcg_at_k_score, mrr_score |
| query_alignment | context_relevance_score |
| response_alignment | response_relevance_score, response_helpfulness_score |
| user defined module | User-registered custom metrics |
🧠 Agentic Workflow:
EvalBench follows a three-step agentic pipeline, automatically triggered based on user instructions:
- Evaluation – Runs relevant metrics to score model outputs. EvalBench intelligently selects which metrics to use if not explicitly specified.
- Interpretation – Analyzes the evaluation results and highlights potential issues with model behavior.
- Recommendation – Suggests improvements to prompts, model setup, data inputs, or evaluation strategy.
Just write your request in plain language — EvalBench will take care of the rest.
🚀 Usage
pip install evalbench
All usage examples, including how to write your own custom metrics and how to use the agentic pipeline in practice, are available in this Jupyter notebook:
💡 Use Cases
EvalBench is ideal for:
- Evaluating LLM apps like summarizers, chatbots, and search agents using built-in metrics
- Integrating custom, domain-specific metrics into the EvalBench's ecosystem
- Getting automatic eval → interpret → recommend pipelines from natural language instructions
- Rapidly iterating on model outputs, prompts, and evaluation strategies
🚧 Coming Soon
- Dataset evaluation integration
- Ecosystem integration - langchain/llama_index hooks
- CLI support
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evalbench-1.0.1.tar.gz.
File metadata
- Download URL: evalbench-1.0.1.tar.gz
- Upload date:
- Size: 22.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
226633a3c645eaefc847e2e07e51a13395cfae752e93922f8170f720dc0689d6
|
|
| MD5 |
886e6fdaf5385d50b17d5282d2dd7b18
|
|
| BLAKE2b-256 |
7fa774e47807eff152af33612f92340b172bfa26f59baf5d36aa7f464150d689
|
File details
Details for the file evalbench-1.0.1-py3-none-any.whl.
File metadata
- Download URL: evalbench-1.0.1-py3-none-any.whl
- Upload date:
- Size: 30.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3aaf07d9defa96b34d85813a60a7be995a9951814fa7f24742a525c342d7622e
|
|
| MD5 |
8d0600bd4f0e8462abb30828cad325e6
|
|
| BLAKE2b-256 |
6df369ad44785038ec2b34813f79d45d9cbb6615bc9a643ff33b4871c6ada030
|