llama-index packs evaluator_benchmarker integration
Project description
Evaluator Benchmarker Pack
A pack for quick computation of benchmark results of your own LLM evaluator on an Evaluation llama-dataset. Specifically, this pack supports benchmarking an appropriate evaluator on the following llama-datasets:
LabelledEvaluatorDataset
for single-grading evaluationsLabelledPairwiseEvaluatorDataset
for pairwise-grading evaluations
These llama-datasets can be downloaed from llama-hub.
CLI Usage
You can download llamapacks directly using llamaindex-cli
, which comes installed with the llama-index
python package:
llamaindex-cli download-llamapack EvaluatorBenchmarkerPack --download-dir ./evaluator_benchmarker_pack
You can then inspect the files at ./evaluator_benchmarker_pack
and use them as a template for your own project!
Code Usage
You can download the pack to the ./evaluator_benchmarker_pack
directory through python
code as well. The sample script below demonstrates how to construct EvaluatorBenchmarkerPack
using a LabelledPairwiseEvaluatorDataset
downloaded from llama-hub
and a
PairwiseComparisonEvaluator
that uses GPT-4 as the LLM. Note though that this pack
can also be used on a LabelledEvaluatorDataset
with a BaseEvaluator
that performs
single-grading evaluation — in this case, the usage flow remains the same.
from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.llms.openai import OpenAI
from llama_index.core import ServiceContext
# download a LabelledRagDataset from llama-hub
pairwise_dataset = download_llama_dataset(
"MiniMtBenchHumanJudgementDataset", "./data"
)
# define your evaluator
gpt_4_context = ServiceContext.from_defaults(
llm=OpenAI(temperature=0, model="gpt-4"),
)
evaluator = PairwiseComparisonEvaluator(service_context=gpt_4_context)
# download and install dependencies
EvaluatorBenchmarkerPack = download_llama_pack(
"EvaluatorBenchmarkerPack", "./pack"
)
# construction requires an evaluator and an eval_dataset
evaluator_benchmarker_pack = EvaluatorBenchmarkerPack(
evaluator=evaluator,
eval_dataset=pairwise_dataset,
show_progress=True,
)
# PERFORM EVALUATION
benchmark_df = evaluator_benchmarker_pack.run() # async arun() also supported
print(benchmark_df)
Output:
number_examples 1689
inconclusives 140
ties 379
agreement_rate_with_ties 0.657844
agreement_rate_without_ties 0.828205
Note that evaluator_benchmarker_pack.run()
will also save the benchmark_df
files in the same directory.
.
└── benchmark.csv
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for llama_index_packs_evaluator_benchmarker-0.2.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 672940b62051256a2cb55155455df2b6da277f4aa44765f2cdba89d5ed095aec |
|
MD5 | 58c07e9585c384c198aef38087becbf1 |
|
BLAKE2b-256 | f0a7e8c559f8c6a9ab31da2549eb032984eb7e8e20afd7a54d6dc7aa22311725 |
Hashes for llama_index_packs_evaluator_benchmarker-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d2f5209742d3d3c02392461028dcc60aeffdd7ce7db27783d58882138de8cc06 |
|
MD5 | 299009163b7b11f1d0d7cb75a5e2c1aa |
|
BLAKE2b-256 | 16641abd3dfffe7a51b1f984726467dea02c9bf9a9ec5980d0ed7960747e51c6 |