A library containing LLM benchmarking tools
Project description
Flow Benchmark Tools
Create and run LLM benchmarks.
Installation
Just the library:
pip install flow-benchmark-tools:1.1.0
Library + Example benchmarks (see below):
pip install "flow-benchmark-tools[examples]:1.1.0"
Usage
-
Create an agent by inheriting BenchmarkAgent and implementing the
run_benchmark_casemethod. -
Create a Benchmark by compiling a list of BenchmarkCases. These can be read from a JSONL file.
-
Associate agent and benchmark in a BenchmarkRun.
-
Use a BenchmarkRunner to run your BenchmarkRun.
Running example RAG benchmarks
Two end-to-end benchmark examples are provided in the examples folder: a LangChain RAG application and an OpenAI Assistant agent.
To run the LangChain RAG benchmark:
python src/examples/langchain_rag_agent.py
To run the OpenAI Assistant benchmark:
python src/examples/openai_assistant_agent.py
The rag benchmark cases are defined in data/rag_benchmark.jsonl.
The two examples follow the typical usage pattern of the library:
- define an agent by implementing the BenchmarkAgent interface and overriding the
run_benchmark_casemethod (you can also override thebeforeandaftermethods, if needed), - create a set of benchmark cases, typically as a JSONL file such as data/rag_benchmark.jsonl,
- use a BenchmarkRunner to run the benchmark.
Running example criteria benchmark
An application of a criteria benchmark is also provided in examples folder: a Criteria application that assesses the quality of pre-computed LLM outputs based on the criteria defined in each benchmark case.
To run the Criteria benchmark:
python src/examples/criteria_evaluation_agent.py
The criteria benchmark cases are defined in data/criteria_benchmark.jsonl.
This example follows a different application of the library:
- define an agent implementing the BenchmarkAgent interface. In this application, each case already has the output we want to evaluate, so we override the
run_benchmark_casemethod to simply repackage eachBenchmarkCaseasBenchmarkCaseResponse. - create a set of quality benchmark cases, typically as a JSONL file such as data/criteria_benchmark.jsonl. In this application, each case's "extra" dictionary includes a "criteria" string.
- use a custom CriteriaBenchmarkRunner which overrides the
_execute_benchmark_casemethod, to run the benchmark using an evaluator that inherits fromCriteriaEvaluator
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flow_benchmark_tools-1.2.0.tar.gz.
File metadata
- Download URL: flow_benchmark_tools-1.2.0.tar.gz
- Upload date:
- Size: 855.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d48fe427b7bf5b30db0aec3fa493fe9a9a7269f5bf97c838c59fdda0de70ad9c
|
|
| MD5 |
d498548852a9aba449930c7f7f4b5744
|
|
| BLAKE2b-256 |
f24694b9a1c492c6c012b50828c84dc3ea36295dc9c2dae128791bbcb15752ad
|
File details
Details for the file flow_benchmark_tools-1.2.0-py3-none-any.whl.
File metadata
- Download URL: flow_benchmark_tools-1.2.0-py3-none-any.whl
- Upload date:
- Size: 26.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b7121fe9bd8dd9c9dc308bfa688fdec7066a5ab0bde0e6d5963ad278293f001
|
|
| MD5 |
1c5c62175f23efc28409270459527417
|
|
| BLAKE2b-256 |
b4b82a95f759d5f4bfe49d1908009483b7a102fd68c70d6b645e7aaa5016a54e
|