Skip to main content

validate models for production

Project description

Bench

Bench is a tool for evaluating LLMs for production use cases. Whether you are comparing different LLMs, considering different prompts, or testing generation hyperparameters like temperature and # tokens, Bench provides one touch point for all your LLM performance evaluation.

If you have encountered a need for any of the following in your LLM work, then Bench can help with your evaluation:

  • to standardize the workflow of LLM evaluation with a common interface across tasks and use cases
  • to test whether open source LLMs can do as well as the top closed-source LLM API providers on your specific data
  • to translate the rankings on LLM leaderboards and benchmarks into scores that you care about for your actual use case

Package installation

Install Bench to your python environment with optional dependencies for serving results locally (recommended):
pip install 'arthur-bench[server]'

Alternatively, install Bench to your python environment with minimum dependencies: pip install arthur-bench

For further setup instructions visit our installation guide

Using Bench

For a more in-depth walkthrough of using bench, visit our quickstart walkthrough and our test suite creation guide on our docs.

To make sure you can run test suites in bench, you can run the following code snippets to create a test suite and run it to give a score to candidate outputs.

from arthur_bench.run.testsuite import TestSuite
suite = TestSuite(
    "bench_quickstart", 
    "exact_match",
    input_text_list=["What year was FDR elected?", "What is the opposite of down?"], 
    reference_output_list=["1932", "up"]
)
suite.run("quickstart_run", candidate_output_list=["1932", "up is the opposite of down"])

Saved test suites can be loaded later on to benchmark test performance over time, without needing to re-prepare reference data:

existing_suite = TestSuite("bench_quickstart", "exact_match")
existing_suite.run("quickstart_new_run", candidate_output_list=["1936", "up"])

To view the results for these runs in the local UI that comes with the bench package, run bench from the command line (this requires the bench optional server dependencies to be installed):

bench

Viewing examples in the bench UI will look something like this:

Examples UI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arthur_bench-0.2.1.tar.gz (5.1 MB view details)

Uploaded Source

Built Distribution

arthur_bench-0.2.1-py3-none-any.whl (5.1 MB view details)

Uploaded Python 3

File details

Details for the file arthur_bench-0.2.1.tar.gz.

File metadata

  • Download URL: arthur_bench-0.2.1.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for arthur_bench-0.2.1.tar.gz
Algorithm Hash digest
SHA256 bb3e3e05c52d6bfcffc911dc7d9b3fe0f23a9a3c9297e81ee920fcdb35323929
MD5 8d3b64e75f62df7eb284baa2e7dd3ea6
BLAKE2b-256 71f400bfe78aa7313601030ef958b60c8469f5cda4cae2c9038263336b125f44

See more details on using hashes here.

File details

Details for the file arthur_bench-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for arthur_bench-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f8ddd8a5d516cc278bdb38b6091f8dfe7758eb30ac3e9b71a80b653cf28a3bad
MD5 44d585e9e7c446ccd4704fb4fc8ff514
BLAKE2b-256 da5bd8b5946c4c475f740bad2eca7242796ce71cc6e9f65ba2949573a82d7546

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page