llama-index packs evaluator_benchmarker integration
Project description
Evaluator Benchmarker Pack
A pack for quick computation of benchmark results of your own LLM evaluator on an Evaluation llama-dataset. Specifically, this pack supports benchmarking an appropriate evaluator on the following llama-datasets:
LabelledEvaluatorDataset
for single-grading evaluationsLabelledPairwiseEvaluatorDataset
for pairwise-grading evaluations
These llama-datasets can be downloaed from llama-hub.
CLI Usage
You can download llamapacks directly using llamaindex-cli
, which comes installed with the llama-index
python package:
llamaindex-cli download-llamapack EvaluatorBenchmarkerPack --download-dir ./evaluator_benchmarker_pack
You can then inspect the files at ./evaluator_benchmarker_pack
and use them as a template for your own project!
Code Usage
You can download the pack to the ./evaluator_benchmarker_pack
directory through python
code as well. The sample script below demonstrates how to construct EvaluatorBenchmarkerPack
using a LabelledPairwiseEvaluatorDataset
downloaded from llama-hub
and a
PairwiseComparisonEvaluator
that uses GPT-4 as the LLM. Note though that this pack
can also be used on a LabelledEvaluatorDataset
with a BaseEvaluator
that performs
single-grading evaluation — in this case, the usage flow remains the same.
from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.core.llms import OpenAI
from llama_index import ServiceContext
# download a LabelledRagDataset from llama-hub
pairwise_dataset = download_llama_dataset(
"MiniMtBenchHumanJudgementDataset", "./data"
)
# define your evaluator
gpt_4_context = ServiceContext.from_defaults(
llm=OpenAI(temperature=0, model="gpt-4"),
)
evaluator = PairwiseComparisonEvaluator(service_context=gpt_4_context)
# download and install dependencies
EvaluatorBenchmarkerPack = download_llama_pack(
"EvaluatorBenchmarkerPack", "./pack"
)
# construction requires an evaluator and an eval_dataset
evaluator_benchmarker_pack = EvaluatorBenchmarkerPack(
evaluator=evaluator,
eval_dataset=pairwise_dataset,
show_progress=True,
)
# PERFORM EVALUATION
benchmark_df = evaluator_benchmarker_pack.run() # async arun() also supported
print(benchmark_df)
Output:
number_examples 1689
inconclusives 140
ties 379
agreement_rate_with_ties 0.657844
agreement_rate_without_ties 0.828205
Note that evaluator_benchmarker_pack.run()
will also save the benchmark_df
files in the same directory.
.
└── benchmark.csv
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for llama_index_packs_evaluator_benchmarker-0.1.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | fecd8e651e89927f11c11c6feea586680a0d9eb4c908831f421432c7e1bf7ae0 |
|
MD5 | ee204426490f2a552bc9aa6011ec255e |
|
BLAKE2b-256 | 7b01355200797fd98d20afaf0e18be14cb2b091e5ded34a5ef35ab7911d24009 |
Hashes for llama_index_packs_evaluator_benchmarker-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d5517931553468d3d8f30a11cb83554a1b989c427c2126ae6e18a588a8bb7861 |
|
MD5 | b148356c32521896dbad80175d2c0201 |
|
BLAKE2b-256 | 83b1c2e7ca1f02a521606c2214484408fd825060745b996b3c8c1bab460428eb |