llama-index packs evaluator_benchmarker integration
Project description
Evaluator Benchmarker Pack
A pack for quick computation of benchmark results of your own LLM evaluator on an Evaluation llama-dataset. Specifically, this pack supports benchmarking an appropriate evaluator on the following llama-datasets:
LabelledEvaluatorDataset
for single-grading evaluationsLabelledPairwiseEvaluatorDataset
for pairwise-grading evaluations
These llama-datasets can be downloaed from llama-hub.
CLI Usage
You can download llamapacks directly using llamaindex-cli
, which comes installed with the llama-index
python package:
llamaindex-cli download-llamapack EvaluatorBenchmarkerPack --download-dir ./evaluator_benchmarker_pack
You can then inspect the files at ./evaluator_benchmarker_pack
and use them as a template for your own project!
Code Usage
You can download the pack to the ./evaluator_benchmarker_pack
directory through python
code as well. The sample script below demonstrates how to construct EvaluatorBenchmarkerPack
using a LabelledPairwiseEvaluatorDataset
downloaded from llama-hub
and a
PairwiseComparisonEvaluator
that uses GPT-4 as the LLM. Note though that this pack
can also be used on a LabelledEvaluatorDataset
with a BaseEvaluator
that performs
single-grading evaluation — in this case, the usage flow remains the same.
from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.llms.openai import OpenAI
from llama_index.core import ServiceContext
# download a LabelledRagDataset from llama-hub
pairwise_dataset = download_llama_dataset(
"MiniMtBenchHumanJudgementDataset", "./data"
)
# define your evaluator
gpt_4_context = ServiceContext.from_defaults(
llm=OpenAI(temperature=0, model="gpt-4"),
)
evaluator = PairwiseComparisonEvaluator(service_context=gpt_4_context)
# download and install dependencies
EvaluatorBenchmarkerPack = download_llama_pack(
"EvaluatorBenchmarkerPack", "./pack"
)
# construction requires an evaluator and an eval_dataset
evaluator_benchmarker_pack = EvaluatorBenchmarkerPack(
evaluator=evaluator,
eval_dataset=pairwise_dataset,
show_progress=True,
)
# PERFORM EVALUATION
benchmark_df = evaluator_benchmarker_pack.run() # async arun() also supported
print(benchmark_df)
Output:
number_examples 1689
inconclusives 140
ties 379
agreement_rate_with_ties 0.657844
agreement_rate_without_ties 0.828205
Note that evaluator_benchmarker_pack.run()
will also save the benchmark_df
files in the same directory.
.
└── benchmark.csv
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file llama_index_packs_evaluator_benchmarker-0.3.0.tar.gz
.
File metadata
- Download URL: llama_index_packs_evaluator_benchmarker-0.3.0.tar.gz
- Upload date:
- Size: 3.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.10 Darwin/22.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bdc246bc5bef3ce47ce6d7018535ab73f5472df1925749606de9189b6e6ddf04 |
|
MD5 | 9f24dea3b9fe801122379ada79991e5c |
|
BLAKE2b-256 | 62058d14c4ba9ed6df9c75fb3d76c26bbdb27d11c621c61cfcf974605d7ab6e7 |
File details
Details for the file llama_index_packs_evaluator_benchmarker-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: llama_index_packs_evaluator_benchmarker-0.3.0-py3-none-any.whl
- Upload date:
- Size: 4.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.10 Darwin/22.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c729fd578eb4ec36dc6ac3b21c835c20a29341d15fe3ffe89d41bffc70bfee10 |
|
MD5 | c6b8d5b52054bbcc72958a278b01ac21 |
|
BLAKE2b-256 | f4fdfe4f0199dbc028034cbffd930083627b2790e2527c5c5b7079952440cf61 |