Skip to main content

Lightweight, modular library for evaluating LLM outputs with built-in metrics, custom extensions, and agentic workflows.

Project description

EvalBench

EvalBench is a plug-and-play Python package for evaluating outputs of large language models (LLMs) across a variety of metrics - from response quality and retrieval accuracy to hallucination and prompt alignment.

It now includes agentic workflows: just describe what you want to understand or improve about your LLM outputs, and EvalBench will plan and execute a tailored sequence of evaluation, interpretation, and recommendation steps — automatically!

🚀 Key Features:

  • 18+ built-in metrics covering coherence, relevance, hallucination, BLEU, ROUGE, MRR, and more
  • User-defined custom metrics with a simple decorator-based API
  • Modular architecture to group related metrics and share inputs
  • Agentic execution: EvalBench can reason about your goal and execute the necessary steps (evaluate → interpret → recommend)
  • Batch support, configurable output (print/save), and JSON-compatible results

📊 Modules and Metric Categories:

Module Metrics
response_quality conciseness_score, coherence_score, factuality_score
reference_based bleu_score, rouge_score, meteor_score, semantic_similarity_score, bert_score
contextual_generation faithfulness_score, hallucination_score, groundedness_score
retrieval recall_at_k_score, precision_at_k_score, ndcg_at_k_score, mrr_score
query_alignment context_relevance_score
response_alignment response_relevance_score, response_helpfulness_score
user defined module User-registered custom metrics

🧠 Agentic Workflow:

EvalBench follows a three-step agentic pipeline, automatically triggered based on user instructions:

  1. Evaluation – Runs relevant metrics to score model outputs. EvalBench intelligently selects which metrics to use if not explicitly specified.
  2. Interpretation – Analyzes the evaluation results and highlights potential issues with model behavior.
  3. Recommendation – Suggests improvements to prompts, model setup, data inputs, or evaluation strategy.

Just write your request in plain language — EvalBench will take care of the rest.


🚀 Usage

pip install evalbench

All usage examples, including how to write your own custom metrics and how to use the agentic pipeline in practice, are available in this Jupyter notebook:

👉 View the Notebook


💡 Use Cases

EvalBench is ideal for:

  • Evaluating LLM apps like summarizers, chatbots, and search agents using built-in metrics
  • Integrating custom, domain-specific metrics into the EvalBench's ecosystem
  • Getting automatic eval → interpret → recommend pipelines from natural language instructions
  • Rapidly iterating on model outputs, prompts, and evaluation strategies

🚧 Coming Soon

  • Dataset evaluation integration
  • Ecosystem integration - langchain/llama_index hooks
  • CLI support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalbench-0.7.0.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalbench-0.7.0-py3-none-any.whl (30.6 kB view details)

Uploaded Python 3

File details

Details for the file evalbench-0.7.0.tar.gz.

File metadata

  • Download URL: evalbench-0.7.0.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for evalbench-0.7.0.tar.gz
Algorithm Hash digest
SHA256 11e500f2b621021708de6691ca027e79032f3e388a818971a89e805a677a7dda
MD5 ea4823dd6500ed60f7dcc87e78e3dbb9
BLAKE2b-256 ff32199e50aafbf9306154191dd6ef32a6e1e8d087aac48cf2840ddcf9a842d8

See more details on using hashes here.

File details

Details for the file evalbench-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: evalbench-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 30.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for evalbench-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f1e021381acae512c39c966b72eb9af714ea8197629f40482b00013c52959c8e
MD5 9b60095850e35098f7e9a13709c6c243
BLAKE2b-256 d1e534e840e0b3a70f268020d6c09a385a77650329d7472232a92d97c36a36bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page