Project description

Evaluator Benchmarker Pack

A pack for quick computation of benchmark results of your own LLM evaluator on an Evaluation llama-dataset. Specifically, this pack supports benchmarking an appropriate evaluator on the following llama-datasets:

LabelledEvaluatorDataset for single-grading evaluations
LabelledPairwiseEvaluatorDataset for pairwise-grading evaluations

These llama-datasets can be downloaed from llama-hub.

CLI Usage

You can download llamapacks directly using llamaindex-cli, which comes installed with the llama-index python package:

llamaindex-cli download-llamapack EvaluatorBenchmarkerPack --download-dir ./evaluator_benchmarker_pack

You can then inspect the files at ./evaluator_benchmarker_pack and use them as a template for your own project!

Code Usage

You can download the pack to the ./evaluator_benchmarker_pack directory through python code as well. The sample script below demonstrates how to construct EvaluatorBenchmarkerPack using a LabelledPairwiseEvaluatorDataset downloaded from llama-hub and a PairwiseComparisonEvaluator that uses GPT-4 as the LLM. Note though that this pack can also be used on a LabelledEvaluatorDataset with a BaseEvaluator that performs single-grading evaluation — in this case, the usage flow remains the same.

from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.llms.openai import OpenAI
from llama_index.core import ServiceContext

# download a LabelledRagDataset from llama-hub
pairwise_dataset = download_llama_dataset(
    "MiniMtBenchHumanJudgementDataset", "./data"
)

# define your evaluator
gpt_4_context = ServiceContext.from_defaults(
    llm=OpenAI(temperature=0, model="gpt-4"),
)
evaluator = PairwiseComparisonEvaluator(service_context=gpt_4_context)


# download and install dependencies
EvaluatorBenchmarkerPack = download_llama_pack(
    "EvaluatorBenchmarkerPack", "./pack"
)

# construction requires an evaluator and an eval_dataset
evaluator_benchmarker_pack = EvaluatorBenchmarkerPack(
    evaluator=evaluator,
    eval_dataset=pairwise_dataset,
    show_progress=True,
)

# PERFORM EVALUATION
benchmark_df = evaluator_benchmarker_pack.run()  # async arun() also supported
print(benchmark_df)

Output:

number_examples                1689
inconclusives                  140
ties                           379
agreement_rate_with_ties       0.657844
agreement_rate_without_ties    0.828205

Note that evaluator_benchmarker_pack.run() will also save the benchmark_df files in the same directory.

.
└── benchmark.csv

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Aug 22, 2024

0.1.3

Feb 22, 2024

0.1.2

Feb 13, 2024

0.1.0

Feb 10, 2024

0.0.1

Feb 2, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_packs_evaluator_benchmarker-0.2.0.tar.gz (3.9 kB view hashes)

Uploaded Aug 22, 2024 Source

Built Distribution

llama_index_packs_evaluator_benchmarker-0.2.0-py3-none-any.whl (4.3 kB view hashes)

Uploaded Aug 22, 2024 Python 3

Hashes for llama_index_packs_evaluator_benchmarker-0.2.0.tar.gz

Hashes for llama_index_packs_evaluator_benchmarker-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`672940b62051256a2cb55155455df2b6da277f4aa44765f2cdba89d5ed095aec`
MD5	`58c07e9585c384c198aef38087becbf1`
BLAKE2b-256	`f0a7e8c559f8c6a9ab31da2549eb032984eb7e8e20afd7a54d6dc7aa22311725`

Hashes for llama_index_packs_evaluator_benchmarker-0.2.0-py3-none-any.whl

Hashes for llama_index_packs_evaluator_benchmarker-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d2f5209742d3d3c02392461028dcc60aeffdd7ce7db27783d58882138de8cc06`
MD5	`299009163b7b11f1d0d7cb75a5e2c1aa`
BLAKE2b-256	`16641abd3dfffe7a51b1f984726467dea02c9bf9a9ec5980d0ed7960747e51c6`