A tool for easy benchmarking.
Project description
BenchFlow
BenchFlow is an Open-source Benchmark Hub and Eval Infra for AI production and benchmark developers.
Overview
Within the dashed box, you will find the interfaces (BaseAgent, BenchClient) provided by BenchFlow. For benchmark users, you are required to extend and implement the BaseAgent interface to interact with the benchmark. The call_api method supplies a step_input which provides the input for each step of a task (a task may have one or more steps).
Quick Start For Benchmark Users
-
Install BenchFlow
git clone https://github.com/benchflow-ai/benchflow.git cd benchflow pip install -e .
-
Browse Benchmarks
Find benchmarks tailored to your needs on our Benchmark Hub.
-
Implement Your Agent
Extend the BaseAgent interface:
def call_api(self, task_step_inputs: Dict[str, Any]) -> str: pass
Optional: You can include a
requirements.txtfile to install additional dependencies, such asopenaiandrequests. -
Test Your Agent
Here is a quick example to run your agent:
import os from benchflow import load_benchmark from benchflow.agents.webarena_openai import WebarenaAgent # The benchmark name follows the schema: org_name/benchmark_name. # You can obtain the benchmark name from the Benchmark Hub. bench = load_benchmark(benchmark_name="benchflow/webarena", bf_token=os.getenv("BF_TOKEN")) your_agents = WebarenaAgent() run_ids = bench.run( task_ids=[0], agents=your_agents, api={"provider": "openai", "model": "gpt-4o-mini", "OPENAI_API_KEY": os.getenv("OPENAI_API_KEY")}, requirements_txt="webarena_requirements.txt", args={} ) results = bench.get_results(run_ids)
Quick Start for Benchmark Developers
-
Install BenchFlow
Install BenchFlow via pip:
pip install benchflow
-
Embed
BenchClientinto Your Benchmark Evaluation ScriptsRefer to this example for how MMLU-Pro integrates
BenchClient. -
Containerize Your Benchmark and Upload the Image to Dockerhub
Ensure your benchmark can run in a single container without any additional steps. Below is an example Dockerfile for MMLU-Pro:
FROM python:3.11-slim COPY . /app WORKDIR /app COPY scripts/entrypoint.sh /app/entrypoint.sh RUN chmod +x /app/entrypoint.sh RUN pip install -r requirements.txt ENTRYPOINT ["/app/entrypoint.sh"]
-
Extend
BaseBenchto Run Your BenchmarksSee this example for how MMLU-Pro extends
BaseBench -
Upload Your Benchmark into BenchFlow
Go to the Benchmark Hub and click on
+new benchmarksto upload your benchmark Git repository. Make sure you place thebenchflow_interface.pyfile at the root of your project.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file benchflow-0.1.11.tar.gz.
File metadata
- Download URL: benchflow-0.1.11.tar.gz
- Upload date:
- Size: 324.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba6dccf6e101e2e9b20209e9035827443dfe3a2aab28a576ffc846365a0bc01d
|
|
| MD5 |
38dbd851563fddd62606367c7115753d
|
|
| BLAKE2b-256 |
f569b61bd4f204994e369907e733ebe56284a2664a15442143c7e248c7fc5681
|
File details
Details for the file benchflow-0.1.11-py3-none-any.whl.
File metadata
- Download URL: benchflow-0.1.11-py3-none-any.whl
- Upload date:
- Size: 41.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bca5a7c0c8efc12250464fa5aed2130475747a2267802432cd67461345748079
|
|
| MD5 |
96b10e9296851e9e4316d73573f5a3ec
|
|
| BLAKE2b-256 |
3bc94ed9a31c07e2f12ca2dc17e0fe5633c12f743f0b220bb8e1c4ebb88c38d6
|