Skip to main content

A collection of evaluation metrics for LLM tasks

Project description

EvalBench 🧪📊

Lightweight, extensible, open-source evaluation framework for LLM applications covering approximately 18 metrics and allows adding of user-defined custom metrics


🔍 About EvalBench

EvalBench is a plug-and-play Python package for evaluating outputs of large language models (LLMs) across a variety of metrics.

Supports:

  • Predefined metrics (faithfulness, coherence, BLEU, hallucination, etc.)
  • Easy registration of custom user-defined metrics
  • Evaluation across different modalities/ use-cases
  • Flexible input/output options (print/ save) with support for list-based batching

Modules and Metric Categories:

Module Metrics Required Arguments Argument Types
response_quality conciseness_score, coherence_score, factuality_score response List[str]
reference_based bleu_score, rouge_score, meteor_score, semantic_similarity_score, bert_score reference, generated List[str], List[str]
contextual_generation faithfulness_score, hallucination_score, groundedness_score context, generated List[List[str]], List[str]
retrieval recall_at_k, precision_at_k, ndcg_at_k, mrr_score relevant_docs, retrieved_docs, k List[List[str]], List[List[str]], int
query_alignment context_relevance_score query, context List[str], List[str]
response_alignment answer_relevance_score, helpfulness_score query, response List[str], List[str]
user defined module User-registered custom metrics Varies (user-defined) Varies (user-defined)

EvalBench is especially useful when you're:

  • Building small-scale LLM pipelines
  • Comparing different prompts or model outputs
  • Rapidly iterating on metrics

🚀 Usage

Installation

pip install evalbench

Initialize Configuration

import evalbench as eb

# Create and apply evaluation configuration
config = eb.EvalConfig(groq_api_key="", output_mode="print")
eb.set_config(config)

Usage Examples

1. Evaluate a single predefined metric

response = ["A binary search algorithm reduces the time complexity to O(log n)."]
context = [["Binary search works on sorted arrays and is faster than linear search."]]

eb.faithfulness_score(context=context, generated=response)

# metrics can also be printed as a list
result = eb.faithfulness_score(context=context, generated=response)
print(result) # [0.44]

2. Evaluate all metrics in a predefined module

eb.evaluate_module(
    module=["contextual_generation"],
    context=context,
    generated=response,
)

3. Register custom metrics

Create a custom metric in a file (eg. custom.py):

from evalbench import register_metric, handle_output

@register_metric(name="len_metric", module="custom", required_args=["response", "reference"])
@handle_output()
def my_custom_metric(response, reference):
    return [len(response) / len(reference)]

4. Load custom metric file and evaluate

eb.load_custom_metrics("custom.py")

response = ["A binary search algorithm reduces the time complexity to O(log n).", "The Eiffel Tower is located in Berlin and was built in the 1800s."]
reference = ["Binary search works on sorted arrays and is faster than linear search.", "In Python, a generator yields items one at a time using the 'yield' keyword."]

# Evaluate custom metric directly
eb.my_custom_metric(response, reference)

# Evaluate all metrics in the 'custom' module
eb.evaluate_module(
    module=["custom"],
    response=response,
    reference=reference,
)

💡 Use Cases

EvalBench is built for fast feedback loops while developing:

  • LLM Applications: Chatbots, assistants, summarization tools
  • Prompt Engineering: Compare prompt variations using faithfulness, conciseness, or BLEU
  • Model Evaluation: Benchmark outputs from different model runs
  • Custom Evaluation Design: Rapidly prototype domain-specific metrics

🚧 Coming Soon

  • Basic visualizations: Histograms and score distributions to quickly interpret results
  • Batch mode via JSONL/Pandas: Evaluate and export results from structured files

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalbench-0.1.3.tar.gz (14.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalbench-0.1.3-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file evalbench-0.1.3.tar.gz.

File metadata

  • Download URL: evalbench-0.1.3.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for evalbench-0.1.3.tar.gz
Algorithm Hash digest
SHA256 7441c76dc2848a133662d9a38b496e4a59c346bec203c8f70983ec100d0167bb
MD5 27ec8a3fb5b5f715a9166e2cbbaa34d9
BLAKE2b-256 91f71ce996cc5d1f8b55f5054a0ab05f91fc6e454d43f0ed8b108009f6becd7a

See more details on using hashes here.

File details

Details for the file evalbench-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: evalbench-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for evalbench-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 70ee1c13e6f2da1c728a4b1542e58280ea87dc137f8e060cfd8fa490d01de92b
MD5 d8965dbd46c09586e191f93eec86cf2c
BLAKE2b-256 30ecc1c91d9663b697002cfa6506e7fc39cf227bb1e1440a9c16284d536d2247

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page