A collection of evaluation metrics for LLM tasks
Project description
EvalBench 🧪📊
Lightweight, extensible, open-source evaluation framework for LLM applications covering approximately 18 metrics and allows adding of user-defined custom metrics
🔍 About EvalBench
EvalBench is a plug-and-play Python package for evaluating outputs of large language models (LLMs) across a variety of metrics.
Supports:
- Predefined metrics (faithfulness, coherence, BLEU, hallucination, etc.)
- Easy registration of custom user-defined metrics
- Evaluation across different modalities/ use-cases
- Flexible input/output options (print/ save) with support for list-based batching
Modules and Metric Categories:
| Module | Metrics | Required Arguments | Argument Types |
|---|---|---|---|
| response_quality | conciseness_score, coherence_score, factuality_score | response | List[str] |
| reference_based | bleu_score, rouge_score, meteor_score, semantic_similarity_score, bert_score | reference, generated | List[str], List[str] |
| contextual_generation | faithfulness_score, hallucination_score, groundedness_score | context, generated | List[List[str]], List[str] |
| retrieval | recall_at_k, precision_at_k, ndcg_at_k, mrr_score | relevant_docs, retrieved_docs, k | List[List[str]], List[List[str]], int |
| query_alignment | context_relevance_score | query, context | List[str], List[str] |
| response_alignment | answer_relevance_score, helpfulness_score | query, response | List[str], List[str] |
| user defined module | User-registered custom metrics | Varies (user-defined) | Varies (user-defined) |
EvalBench is especially useful when you're:
- Building small-scale LLM pipelines
- Comparing different prompts or model outputs
- Rapidly iterating on metrics
🚀 Usage
Installation
pip install evalbench
Initialize Configuration
import evalbench as eb
# Create and apply evaluation configuration
config = eb.EvalConfig(groq_api_key="", output_mode="print")
eb.set_config(config)
Usage Examples
1. Evaluate a single predefined metric
response = ["A binary search algorithm reduces the time complexity to O(log n)."]
context = [["Binary search works on sorted arrays and is faster than linear search."]]
eb.faithfulness_score(context=context, generated=response)
# metrics can also be printed as a list
result = eb.faithfulness_score(context=context, generated=response)
print(result) # [0.44]
2. Evaluate all metrics in a predefined module
eb.evaluate_module(
module=["contextual_generation"],
context=context,
generated=response,
)
3. Register custom metrics
Create a custom metric in a file (eg. custom.py):
from evalbench import register_metric, handle_output
@register_metric(name="len_metric", module="custom", required_args=["response", "reference"])
@handle_output()
def my_custom_metric(response, reference):
return [len(response) / len(reference)]
4. Load custom metric file and evaluate
eb.load_custom_metrics("custom.py")
response = ["A binary search algorithm reduces the time complexity to O(log n).", "The Eiffel Tower is located in Berlin and was built in the 1800s."]
reference = ["Binary search works on sorted arrays and is faster than linear search.", "In Python, a generator yields items one at a time using the 'yield' keyword."]
# Evaluate custom metric directly
eb.my_custom_metric(response, reference)
# Evaluate all metrics in the 'custom' module
eb.evaluate_module(
module=["custom"],
response=response,
reference=reference,
)
💡 Use Cases
EvalBench is built for fast feedback loops while developing:
- LLM Applications: Chatbots, assistants, summarization tools
- Prompt Engineering: Compare prompt variations using faithfulness, conciseness, or BLEU
- Model Evaluation: Benchmark outputs from different model runs
- Custom Evaluation Design: Rapidly prototype domain-specific metrics
🚧 Coming Soon
- Basic visualizations: Histograms and score distributions to quickly interpret results
- Batch mode via JSONL/Pandas: Evaluate and export results from structured files
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evalbench-0.1.4.tar.gz.
File metadata
- Download URL: evalbench-0.1.4.tar.gz
- Upload date:
- Size: 14.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fdc3ec41dda4602b40d2c7af89b89e80053194ec11701ed4a5340de0dad765c
|
|
| MD5 |
dbbbd8274f298463c557cb32cbe6f457
|
|
| BLAKE2b-256 |
d1be09aaf308bb1d85bdaf9f8e3cf7448b7c4c2f350c53fbabf8c18a5e451ac9
|
File details
Details for the file evalbench-0.1.4-py3-none-any.whl.
File metadata
- Download URL: evalbench-0.1.4-py3-none-any.whl
- Upload date:
- Size: 19.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d87990e310a1399b2b21ca8aee63940eda2eccd865cfdd451abad89407c6bd60
|
|
| MD5 |
debe670d7b55d72a33fe3944be965937
|
|
| BLAKE2b-256 |
acda80108f5de052c28be29e51c065f020370df2d89aedc4f640879c9bd7fac1
|