Skip to main content

llama-index packs evaluator_benchmarker integration

Project description

Evaluator Benchmarker Pack

A pack for quick computation of benchmark results of your own LLM evaluator on an Evaluation llama-dataset. Specifically, this pack supports benchmarking an appropriate evaluator on the following llama-datasets:

  • LabelledEvaluatorDataset for single-grading evaluations
  • LabelledPairwiseEvaluatorDataset for pairwise-grading evaluations

These llama-datasets can be downloaed from llama-hub.

CLI Usage

You can download llamapacks directly using llamaindex-cli, which comes installed with the llama-index python package:

llamaindex-cli download-llamapack EvaluatorBenchmarkerPack --download-dir ./evaluator_benchmarker_pack

You can then inspect the files at ./evaluator_benchmarker_pack and use them as a template for your own project!

Code Usage

You can download the pack to the ./evaluator_benchmarker_pack directory through python code as well. The sample script below demonstrates how to construct EvaluatorBenchmarkerPack using a LabelledPairwiseEvaluatorDataset downloaded from llama-hub and a PairwiseComparisonEvaluator that uses GPT-4 as the LLM. Note though that this pack can also be used on a LabelledEvaluatorDataset with a BaseEvaluator that performs single-grading evaluation — in this case, the usage flow remains the same.

from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.llms.openai import OpenAI
from llama_index.core import ServiceContext

# download a LabelledRagDataset from llama-hub
pairwise_dataset = download_llama_dataset(
    "MiniMtBenchHumanJudgementDataset", "./data"
)

# define your evaluator
gpt_4_context = ServiceContext.from_defaults(
    llm=OpenAI(temperature=0, model="gpt-4"),
)
evaluator = PairwiseComparisonEvaluator(service_context=gpt_4_context)


# download and install dependencies
EvaluatorBenchmarkerPack = download_llama_pack(
    "EvaluatorBenchmarkerPack", "./pack"
)

# construction requires an evaluator and an eval_dataset
evaluator_benchmarker_pack = EvaluatorBenchmarkerPack(
    evaluator=evaluator,
    eval_dataset=pairwise_dataset,
    show_progress=True,
)

# PERFORM EVALUATION
benchmark_df = evaluator_benchmarker_pack.run()  # async arun() also supported
print(benchmark_df)

Output:

number_examples                1689
inconclusives                  140
ties                           379
agreement_rate_with_ties       0.657844
agreement_rate_without_ties    0.828205

Note that evaluator_benchmarker_pack.run() will also save the benchmark_df files in the same directory.

.
└── benchmark.csv

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

File details

Details for the file llama_index_packs_evaluator_benchmarker-0.3.0.tar.gz.

File metadata

File hashes

Hashes for llama_index_packs_evaluator_benchmarker-0.3.0.tar.gz
Algorithm Hash digest
SHA256 bdc246bc5bef3ce47ce6d7018535ab73f5472df1925749606de9189b6e6ddf04
MD5 9f24dea3b9fe801122379ada79991e5c
BLAKE2b-256 62058d14c4ba9ed6df9c75fb3d76c26bbdb27d11c621c61cfcf974605d7ab6e7

See more details on using hashes here.

File details

Details for the file llama_index_packs_evaluator_benchmarker-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_packs_evaluator_benchmarker-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c729fd578eb4ec36dc6ac3b21c835c20a29341d15fe3ffe89d41bffc70bfee10
MD5 c6b8d5b52054bbcc72958a278b01ac21
BLAKE2b-256 f4fdfe4f0199dbc028034cbffd930083627b2790e2527c5c5b7079952440cf61

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page