llama-index packs evaluator_benchmarker integration
Project description
Evaluator Benchmarker Pack
A pack for quick computation of benchmark results of your own LLM evaluator on an Evaluation llama-dataset. Specifically, this pack supports benchmarking an appropriate evaluator on the following llama-datasets:
LabelledEvaluatorDataset
for single-grading evaluationsLabelledPairwiseEvaluatorDataset
for pairwise-grading evaluations
These llama-datasets can be downloaed from llama-hub.
CLI Usage
You can download llamapacks directly using llamaindex-cli
, which comes installed with the llama-index
python package:
llamaindex-cli download-llamapack EvaluatorBenchmarkerPack --download-dir ./evaluator_benchmarker_pack
You can then inspect the files at ./evaluator_benchmarker_pack
and use them as a template for your own project!
Code Usage
You can download the pack to the ./evaluator_benchmarker_pack
directory through python
code as well. The sample script below demonstrates how to construct EvaluatorBenchmarkerPack
using a LabelledPairwiseEvaluatorDataset
downloaded from llama-hub
and a
PairwiseComparisonEvaluator
that uses GPT-4 as the LLM. Note though that this pack
can also be used on a LabelledEvaluatorDataset
with a BaseEvaluator
that performs
single-grading evaluation — in this case, the usage flow remains the same.
from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.core.llms import OpenAI
from llama_index import ServiceContext
# download a LabelledRagDataset from llama-hub
pairwise_dataset = download_llama_dataset(
"MiniMtBenchHumanJudgementDataset", "./data"
)
# define your evaluator
gpt_4_context = ServiceContext.from_defaults(
llm=OpenAI(temperature=0, model="gpt-4"),
)
evaluator = PairwiseComparisonEvaluator(service_context=gpt_4_context)
# download and install dependencies
EvaluatorBenchmarkerPack = download_llama_pack(
"EvaluatorBenchmarkerPack", "./pack"
)
# construction requires an evaluator and an eval_dataset
evaluator_benchmarker_pack = EvaluatorBenchmarkerPack(
evaluator=evaluator,
eval_dataset=pairwise_dataset,
show_progress=True,
)
# PERFORM EVALUATION
benchmark_df = evaluator_benchmarker_pack.run() # async arun() also supported
print(benchmark_df)
Output:
number_examples 1689
inconclusives 140
ties 379
agreement_rate_with_ties 0.657844
agreement_rate_without_ties 0.828205
Note that evaluator_benchmarker_pack.run()
will also save the benchmark_df
files in the same directory.
.
└── benchmark.csv
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for llama_index_packs_evaluator_benchmarker-0.1.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18983d399be8ad4a3c75b72e3a44a858a40b218f0a7d4ba3574e65cb198b692a |
|
MD5 | f696d3272da454df034701c11f4c8d31 |
|
BLAKE2b-256 | a0246fa237307c8cfae0ad45b45b6fc6e205b579f7e49a591202ab54a96f2803 |
Hashes for llama_index_packs_evaluator_benchmarker-0.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 428dd83b2aa45b59a75234c0352845251186f5f8f05f90107450a8f66c7ea1c1 |
|
MD5 | 4b255db619956b53d4a73c376193df12 |
|
BLAKE2b-256 | d61a8dd97cbe330ec44b4bff2e6a1198489efc0bb9ceeb6ae44766c354c4308b |