A collection of evaluation metrics for LLM tasks

Project description

EvalBench 🧪📊

Lightweight, extensible, open-source evaluation framework for LLM applications covering approximately 18 metrics and allows adding of user-defined custom metrics

🔍 About EvalBench

EvalBench is a plug-and-play Python package for evaluating outputs of large language models (LLMs) across a variety of metrics.

Supports:

Predefined metrics (faithfulness, coherence, BLEU, hallucination, etc.)
Easy registration of custom user-defined metrics
Evaluation across different modalities/ use-cases
Flexible input/output options (print/ save) with support for list-based batching

Modules and Metric Categories:

Module	Metrics	Required Arguments	Argument Types
response_quality	conciseness_score, coherence_score, factuality_score	response	`List[str]`
reference_based	bleu_score, rouge_score, meteor_score, semantic_similarity_score, bert_score	reference, generated	`List[str]`, `List[str]`
contextual_generation	faithfulness_score, hallucination_score, groundedness_score	context, generated	`List[List[str]]`, `List[str]`
retrieval	recall_at_k, precision_at_k, ndcg_at_k, mrr_score	relevant_docs, retrieved_docs, k	`List[List[str]]`, `List[List[str]]`, `int`
query_alignment	context_relevance_score	query, context	`List[str]`, `List[str]`
response_alignment	answer_relevance_score, helpfulness_score	query, response	`List[str]`, `List[str]`
user defined module	User-registered custom metrics	Varies (user-defined)	Varies (user-defined)

EvalBench is especially useful when you're:

Building small-scale LLM pipelines
Comparing different prompts or model outputs
Rapidly iterating on metrics

🚀 Usage

Installation

pip install evalbench

Initialize Configuration

import evalbench as eb

# Create and apply evaluation configuration
config = eb.EvalConfig(groq_api_key="", output_mode="print")
eb.set_config(config)

Usage Examples

1. Evaluate a single predefined metric

response = ["A binary search algorithm reduces the time complexity to O(log n)."]
context = [["Binary search works on sorted arrays and is faster than linear search."]]

eb.faithfulness_score(context=context, generated=response)

# metrics can also be printed as a list
result = eb.faithfulness_score(context=context, generated=response)
print(result) # [0.44]

2. Evaluate all metrics in a predefined module

eb.evaluate_module(
    module=["contextual_generation"],
    context=context,
    generated=response,
)

3. Register custom metrics

Create a custom metric in a file (eg. custom.py):

from evalbench import register_metric, handle_output

@register_metric(name="len_metric", module="custom", required_args=["response", "reference"])
@handle_output()
def my_custom_metric(response, reference):
    return [len(response) / len(reference)]

4. Load custom metric file and evaluate

eb.load_custom_metrics("custom.py")

response = ["A binary search algorithm reduces the time complexity to O(log n).", "The Eiffel Tower is located in Berlin and was built in the 1800s."]
reference = ["Binary search works on sorted arrays and is faster than linear search.", "In Python, a generator yields items one at a time using the 'yield' keyword."]

# Evaluate custom metric directly
eb.my_custom_metric(response, reference)

# Evaluate all metrics in the 'custom' module
eb.evaluate_module(
    module=["custom"],
    response=response,
    reference=reference,
)

💡 Use Cases

EvalBench is built for fast feedback loops while developing:

LLM Applications: Chatbots, assistants, summarization tools
Prompt Engineering: Compare prompt variations using faithfulness, conciseness, or BLEU
Model Evaluation: Benchmark outputs from different model runs
Custom Evaluation Design: Rapidly prototype domain-specific metrics

🚧 Coming Soon

Basic visualizations: Histograms and score distributions to quickly interpret results
Batch mode via JSONL/Pandas: Evaluate and export results from structured files

Project details

Release history Release notifications | RSS feed

1.0.1

Jun 10, 2025

0.7.0

Jun 10, 2025

0.1.4

May 29, 2025

This version

0.1.3

May 29, 2025

0.1.2

May 28, 2025

0.1.1

May 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalbench-0.1.3.tar.gz (14.6 kB view details)

Uploaded May 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evalbench-0.1.3-py3-none-any.whl (19.1 kB view details)

Uploaded May 29, 2025 Python 3

File details

Details for the file evalbench-0.1.3.tar.gz.

File metadata

Download URL: evalbench-0.1.3.tar.gz
Upload date: May 29, 2025
Size: 14.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for evalbench-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`7441c76dc2848a133662d9a38b496e4a59c346bec203c8f70983ec100d0167bb`
MD5	`27ec8a3fb5b5f715a9166e2cbbaa34d9`
BLAKE2b-256	`91f71ce996cc5d1f8b55f5054a0ab05f91fc6e454d43f0ed8b108009f6becd7a`

See more details on using hashes here.

File details

Details for the file evalbench-0.1.3-py3-none-any.whl.

File metadata

Download URL: evalbench-0.1.3-py3-none-any.whl
Upload date: May 29, 2025
Size: 19.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for evalbench-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`70ee1c13e6f2da1c728a4b1542e58280ea87dc137f8e060cfd8fa490d01de92b`
MD5	`d8965dbd46c09586e191f93eec86cf2c`
BLAKE2b-256	`30ecc1c91d9663b697002cfa6506e7fc39cf227bb1e1440a9c16284d536d2247`

See more details on using hashes here.

evalbench 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

EvalBench 🧪📊

🔍 About EvalBench

Supports:

Modules and Metric Categories:

EvalBench is especially useful when you're:

🚀 Usage

Installation

Initialize Configuration

Usage Examples

1. Evaluate a single predefined metric

2. Evaluate all metrics in a predefined module

3. Register custom metrics

4. Load custom metric file and evaluate

💡 Use Cases

🚧 Coming Soon

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes