validate models for production

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Bench

Bench is a tool for evaluating LLMs for production use cases. Whether you are comparing different LLMs, considering different prompts, or testing generation hyperparameters like temperature and # tokens, Bench provides one touch point for all your LLM performance evaluation.

If you have encountered a need for any of the following in your LLM work, then Bench can help with your evaluation:

to standardize the workflow of LLM evaluation with a common interface across tasks and use cases
to test whether open source LLMs can do as well as the top closed-source LLM API providers on your specific data
to translate the rankings on LLM leaderboards and benchmarks into scores that you care about for your actual use case

Join the bench community on Discord.

For bug fixes and feature requests, please file a Github issue.

Package installation

Install Bench to your python environment with optional dependencies for serving results locally (recommended): pip install 'arthur-bench[server]'

Alternatively, install Bench to your python environment with minimum dependencies: pip install arthur-bench

For further setup instructions visit our installation guide

Using Bench

For a more in-depth walkthrough of using bench, visit our quickstart walkthrough and our test suite creation guide on our docs.

To make sure you can run test suites in bench, you can run the following code snippets to create a test suite and run it to give a score to candidate outputs.

from arthur_bench.run.testsuite import TestSuite
suite = TestSuite(
    "bench_quickstart",
    "exact_match",
    input_text_list=["What year was FDR elected?", "What is the opposite of down?"],
    reference_output_list=["1932", "up"]
)
suite.run("quickstart_run", candidate_output_list=["1932", "up is the opposite of down"])

Saved test suites can be loaded later on to benchmark test performance over time, without needing to re-prepare reference data:

existing_suite = TestSuite("bench_quickstart", "exact_match")
existing_suite.run("quickstart_new_run", candidate_output_list=["1936", "up"])

To view the results for these runs in the local UI that comes with the bench package, run bench from the command line (this requires the bench optional server dependencies to be installed):

bench

Viewing examples in the bench UI will look something like this:

Examples UI

Running Bench from source

To launch Bench from source:

Install the dependencies
- pip install -e '.[server]'
Build the Front End
- cd arthur_bench/server/js
- npm i
- npm run build
Launch the server
- bench

Because the server was installed with pip -e, local changes will be picked up. However, the server will need to be restarted between changes in order for those changes to be picked up.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.3.1

Feb 16, 2024

0.3.0

Dec 20, 2023

0.2.3

Oct 11, 2023

0.2.2

Sep 29, 2023

0.2.1

Aug 17, 2023

0.2.0

Aug 17, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arthur_bench-0.3.1.tar.gz (5.1 MB view hashes)

Uploaded Feb 16, 2024 Source

Built Distribution

arthur_bench-0.3.1-py3-none-any.whl (5.1 MB view hashes)

Uploaded Feb 16, 2024 Python 3

Hashes for arthur_bench-0.3.1.tar.gz

Hashes for arthur_bench-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`d809cb3a1822e446c45546f4932a7b585353f4d7b8edbf0b1c1b80f9f0625364`
MD5	`2dbfe82d01b2d6e31431801799a20265`
BLAKE2b-256	`a4359b25c0687daa8cf82d1621f48f8424b06d4578a86f39c026bc0287663696`

Hashes for arthur_bench-0.3.1-py3-none-any.whl

Hashes for arthur_bench-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ebfb9f2360ace9ead1f56c0691a9643020fd141db6749fbe5f0cc47793e77d6c`
MD5	`f9663efcf6e9e38a9bafeff42569a443`
BLAKE2b-256	`dcea844c1f054f53a96393abe95bb14136b33d953774688d09f0e54b1a3b14c3`