Skip to main content

A tool for easy benchmarking.

Project description

BenchFlow

PyPI - License PyPI - Downloads PyPI - Version

BenchFlow is an Open-source Benchmark Hub and Eval Infra for AI production and benchmark developers.

BenchFlow Diagram

Overview

BenchFlow Overview

Within the dashed box, you will find the interfaces (BaseAgent, BenchClient) provided by BenchFlow. For benchmark users, you are required to extend and implement the BaseAgent interface to interact with the benchmark. The call_api method supplies a step_input which provides the input for each step of a task (a task may have one or more steps).

Quick Start For Benchmark Users

  1. Install BenchFlow

    git clone https://github.com/benchflow-ai/benchflow.git
    cd benchflow
    pip install -e .
    
  2. Browse Benchmarks

    Find benchmarks tailored to your needs on our Benchmark Hub.

  3. Implement Your Agent

    Extend the BaseAgent interface:

    def call_api(self, task_step_inputs: Dict[str, Any]) -> str:
        pass
    

    Optional: You can include a requirements.txt file to install additional dependencies, such as openai and requests.

  4. Test Your Agent

    Here is a quick example to run your agent:

    import os
    from benchflow import load_benchmark
    from benchflow.agents.webarena_openai import WebarenaAgent
    
    # The benchmark name follows the schema: org_name/benchmark_name.
    # You can obtain the benchmark name from the Benchmark Hub.
    bench = load_benchmark(benchmark_name="benchflow/webarena", bf_token=os.getenv("BF_TOKEN"))
    
    your_agents = WebarenaAgent()
    
    run_ids = bench.run(
        task_ids=[0],
        agents=your_agents,
        api={"provider": "openai", "model": "gpt-4o-mini", "OPENAI_API_KEY": os.getenv("OPENAI_API_KEY")},
        requirements_txt="webarena_requirements.txt",
        args={}
    )
    
    results = bench.get_results(run_ids)
    

Quick Start for Benchmark Developers

  1. Install BenchFlow

    Install BenchFlow via pip:

    pip install benchflow
    
  2. Embed BenchClient into Your Benchmark Evaluation Scripts

    Refer to this example for how MMLU-Pro integrates BenchClient.

  3. Containerize Your Benchmark and Upload the Image to Dockerhub

    Ensure your benchmark can run in a single container without any additional steps. Below is an example Dockerfile for MMLU-Pro:

    FROM python:3.11-slim
    
    COPY . /app
    WORKDIR /app
    COPY scripts/entrypoint.sh /app/entrypoint.sh
    
    RUN chmod +x /app/entrypoint.sh
    RUN pip install -r requirements.txt
    
    ENTRYPOINT ["/app/entrypoint.sh"]
    
  4. Extend BaseBench to Run Your Benchmarks

    See this example for how MMLU-Pro extends BaseBench

  5. Upload Your Benchmark into BenchFlow

    Go to the Benchmark Hub and click on +new benchmarks to upload your benchmark Git repository. Make sure you place the benchflow_interface.py file at the root of your project.


License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchflow-0.1.11.tar.gz (324.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchflow-0.1.11-py3-none-any.whl (41.4 kB view details)

Uploaded Python 3

File details

Details for the file benchflow-0.1.11.tar.gz.

File metadata

  • Download URL: benchflow-0.1.11.tar.gz
  • Upload date:
  • Size: 324.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for benchflow-0.1.11.tar.gz
Algorithm Hash digest
SHA256 ba6dccf6e101e2e9b20209e9035827443dfe3a2aab28a576ffc846365a0bc01d
MD5 38dbd851563fddd62606367c7115753d
BLAKE2b-256 f569b61bd4f204994e369907e733ebe56284a2664a15442143c7e248c7fc5681

See more details on using hashes here.

File details

Details for the file benchflow-0.1.11-py3-none-any.whl.

File metadata

File hashes

Hashes for benchflow-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 bca5a7c0c8efc12250464fa5aed2130475747a2267802432cd67461345748079
MD5 96b10e9296851e9e4316d73573f5a3ec
BLAKE2b-256 3bc94ed9a31c07e2f12ca2dc17e0fe5633c12f743f0b220bb8e1c4ebb88c38d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page