llama-index packs evaluator_benchmarker integration
Project description
Evaluator Benchmarker Pack
A pack for quick computation of benchmark results of your own LLM evaluator on an Evaluation llama-dataset. Specifically, this pack supports benchmarking an appropriate evaluator on the following llama-datasets:
LabelledEvaluatorDataset
for single-grading evaluationsLabelledPairwiseEvaluatorDataset
for pairwise-grading evaluations
These llama-datasets can be downloaed from llama-hub.
CLI Usage
You can download llamapacks directly using llamaindex-cli
, which comes installed with the llama-index
python package:
llamaindex-cli download-llamapack EvaluatorBenchmarkerPack --download-dir ./evaluator_benchmarker_pack
You can then inspect the files at ./evaluator_benchmarker_pack
and use them as a template for your own project!
Code Usage
You can download the pack to the ./evaluator_benchmarker_pack
directory through python
code as well. The sample script below demonstrates how to construct EvaluatorBenchmarkerPack
using a LabelledPairwiseEvaluatorDataset
downloaded from llama-hub
and a
PairwiseComparisonEvaluator
that uses GPT-4 as the LLM. Note though that this pack
can also be used on a LabelledEvaluatorDataset
with a BaseEvaluator
that performs
single-grading evaluation — in this case, the usage flow remains the same.
from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.core.llms import OpenAI
from llama_index import ServiceContext
# download a LabelledRagDataset from llama-hub
pairwise_dataset = download_llama_dataset(
"MiniMtBenchHumanJudgementDataset", "./data"
)
# define your evaluator
gpt_4_context = ServiceContext.from_defaults(
llm=OpenAI(temperature=0, model="gpt-4"),
)
evaluator = PairwiseComparisonEvaluator(service_context=gpt_4_context)
# download and install dependencies
EvaluatorBenchmarkerPack = download_llama_pack(
"EvaluatorBenchmarkerPack", "./pack"
)
# construction requires an evaluator and an eval_dataset
evaluator_benchmarker_pack = EvaluatorBenchmarkerPack(
evaluator=evaluator,
eval_dataset=pairwise_dataset,
show_progress=True,
)
# PERFORM EVALUATION
benchmark_df = evaluator_benchmarker_pack.run() # async arun() also supported
print(benchmark_df)
Output:
number_examples 1689
inconclusives 140
ties 379
agreement_rate_with_ties 0.657844
agreement_rate_without_ties 0.828205
Note that evaluator_benchmarker_pack.run()
will also save the benchmark_df
files in the same directory.
.
└── benchmark.csv
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for llama_index_packs_evaluator_benchmarker-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32b46fd61c16d1862ae085b6cc90d78f55c4f32aac79d722d64d0564246f48bb |
|
MD5 | f9512103639b9c2bf55c8f3ba510a3a0 |
|
BLAKE2b-256 | 17b08413409f3af730e9349eb30082bc678bf07272eaf14c9d067c8227cb3ae6 |
Hashes for llama_index_packs_evaluator_benchmarker-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59971fa655ed6559d4c09f70f5565bfbabb40044e00fa513431e0c6eb3d90b7b |
|
MD5 | 012d9e8f4b7b15e946f79643b064f295 |
|
BLAKE2b-256 | 4d1ccf2cd0cd05e6f7471b2e9a48c812529049126adb2d13fbbf85fc06061dd2 |