A tool for easy benchmarking.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

BenchFlow: AI Benchmark Runtime

BenchFlow is an AI benchmark runtime framework that allows you to integrate and evaluate AI tasks using Docker-based benchmarks. The latest version leverages a new BaseBench design to manage logs, results, and environment variable configurations consistently.

Installation Requirements
Agent Development Guide
- Step 1: Define Your Agent
- Step 2: Test Your Agent
Benchmark Integration Guide
API Reference
License

Installation Requirements

Python 3.11+
Docker

Install the BenchFlow package using pip:

pip install benchflow

Agent Development Guide

Step 1: Define Your Agent

Create your Agent by extending BaseAgent. The Agent processes the environment data provided via self.env_info and generates a solution for the task.

from benchflow import BaseAgent

class YourAgent(BaseAgent):
    def __init__(self):
        super().__init__()
    
    def call_api(self) -> str:
        """
        IMPLEMENTATION CONTRACT:
        Process environment data and generate task solution.

        Access:
        - self.env_info: dict containing benchmark-specific data

        Returns:
            str: Unified diff patch or any prediction as a formatted string.
        """
        # Access task parameters
        instance_id = self.env_info['instance_id']
        # Process the data provided in `env_info` and return your prediction
        return (
            "diff --git a/src/rules/L031.py b/src/rules/L031.py\n"
            "--- a/src/rules/L031.py\n"
            "+++ b/src/rules/L031.py\n"
            "@@ -211,7 +211,7 @@ def _lint_aliases_in_join(\n"
            "    violation_buff.append(\n"
            "        LintResult(\n"
            "            anchor=alias_info.alias_identifier_ref,\n"
            " -          description=\"Original message\",\n"
            " +          description=\"Updated message\",\n"
            "            fixes=fixes,\n"
            "        )\n"
            "    )"
        )

Step 2: Test Your Agent

Test your Agent with the benchmark by loading the benchmark module and running the evaluation.

from benchflow import load_benchmark
from your_agent import YourAgent

# Initialize the benchmark (for example, "SWE-Bench")
bench = load_benchmark("SWE-Bench")

# Instantiate your agent
agent = YourAgent()

# Define execution parameters
config = {
    "task_ids": ["astropy__astropy-12907"],
    "agents": agent,
    "install_sh_dir": "setup.sh",
    "requirements_dir": "requirements.txt",
    "api": {"OPENAI_API_KEY": "your_api_key_here"}
}

# Run the evaluation
results = bench.run(**config)

Benchmark Integration Guide

The Benchmark Integration Guide now comprises three steps:

Step 1: Implement BenchClient

Create a class extending BenchClient to transform the raw state into the agent's input and parse the agent's output.

from benchflow import BenchClient
from typing import Dict, Any

class YourClient(BenchClient):
    def prepare_environment(self, state_update: Dict) -> Dict:
        """Transform raw state into agent inputs."""
        return {
            "env_info": {
                "observation": state_update["trajectory"][-1],
                "intent": state_update.get("intent", "")
            }
        }
    
    def parse_action(self, raw_action: str) -> str:
        """Process the agent response."""
        parsed_action = raw_action  # Optionally add post-processing here
        return parsed_action

Step 2: Package and Upload Your Benchmark Docker Image

Before integrating your benchmark, ensure that you have:

Packaged your benchmark logic into a Docker image.
Configured the image to read required environment variables (such as AGENT_URL, TEST_START_IDX, etc.).
Uploaded the Docker image to a public registry (e.g., DockerHub).

For example, tag your image as yourusername/benchmark-name:tag. No code snippet is required for this step.

Step 3: Integrate Your Benchmark

Integrate your benchmark by subclassing BaseBench. In the new implementation, you must implement the following abstract methods:

get_config(params: Dict[str, Any], task_id: str) -> BaseBenchConfig
Returns a configuration instance (derived from BaseBenchConfig) to validate and prepare environment variables.
get_image_name() -> str
Returns the Docker image name for running the benchmark.
get_results_dir_in_container() -> str
Returns the directory inside the container where results will be stored.
get_log_files_dir_in_container() -> str
Returns the directory inside the container where log files will be stored.
get_result(task_id: str) -> Dict[str, Any]
Reads and parses the benchmark results (for example, from log files) and returns a dictionary containing:
- task_id
- is_resolved (a boolean indicating success)
- score (a numerical score)
- message (a dictionary with details or error messages)
- log (log details as a string)
get_all_tasks(split: str) -> Dict[str, Any]
Returns all available task IDs and an optional error message.
cleanup()
Cleans up any temporary resources created during benchmark execution.

Below is an example integration using WebArenaBench:

# webarena_bench.py
import os
import subprocess
from typing import Any, Dict

from benchflow import BaseBench, BaseBenchConfig

# ------------------------------------------------------------------------------
# WebArenaConfig: Define the configuration for WebArenaBench.
# ------------------------------------------------------------------------------
class WebArenaConfig(BaseBenchConfig):
    # For this benchmark, we require the TEST_END_IDX variable.
    required_env = ["TEST_END_IDX"]
    optional_env = []
    defaults = {
        "RESULTS_DIR": "/app/results"
    }

# ------------------------------------------------------------------------------
# WebArenaBench Implementation
# ------------------------------------------------------------------------------
class WebArenaBench(BaseBench):
    def __init__(self):
        super().__init__()

    def get_config(self, params: Dict[str, Any], task_id: str) -> BaseBenchConfig:
        """
        Return a WebArenaConfig instance that validates the input parameters.
        Here, we set TEST_END_IDX so that each run processes only one task.
        """
        params["TEST_END_IDX"] = str(int(task_id) + 1)
        return WebArenaConfig(params)
    
    def get_image_name(self) -> str:
        """
        Return the Docker image name for running the WebArena benchmark.
        """
        return "kirk2000/benchflow:webarena-v1"
    
    def get_results_dir_in_container(self) -> str:
        """
        Return the directory inside the container where benchmark results will be stored.
        """
        return "/app/results"
    
    def get_log_files_dir_in_container(self) -> str:
        """
        Return the directory inside the container where log files will be stored.
        """
        return "/app/log_files"
    
    def get_result(self, task_id: str) -> Dict[str, Any]:
        """
        Read and parse the benchmark result from log files.
        This method expects a file named 'log_files.txt' in the results directory.
        It reads the content of each log file listed, aggregates the logs, and extracts
        the average score and pass status.
        """
        log_files_txt = os.path.join(self.results_dir, "log_files.txt")
        if not os.path.exists(log_files_txt):
            return {
                "is_resolved": False,
                "score": 0,
                "message": {"error": "No results found"},
                "log": ""
            }
        
        log_content = ""
        try:
            with open(log_files_txt, 'r') as f:
                for line in f:
                    log_file_name = os.path.basename(line.strip())
                    # Assume log files are located in the log_files directory under the task_id folder.
                    full_log_path = os.path.join(self.log_files_dir, str(task_id), log_file_name)
                    with open(full_log_path, 'r') as log_file:
                        log_content += log_file.read() + "\n"
        except Exception as e:
            return {
                "is_resolved": False,
                "score": 0,
                "message": {"error": f"Failed to read log files: {e}"},
                "log": log_content
            }
        
        # Parse the log content to extract score and resolution status.
        is_resolved = False
        score = 0.0
        for line in log_content.splitlines():
            if "Average score:" in line:
                try:
                    score = float(line.split(":")[-1].strip())
                except ValueError:
                    score = 0.0
            if "[Result]" in line and "(PASS)" in line:
                is_resolved = True
                    
        return {
            "is_resolved": is_resolved,
            "score": score,
            "message": {"details": "Task runs successfully."},
            "log": log_content
        }
    
    def get_all_tasks(self, split: str) -> Dict[str, Any]:
        """
        Return a dictionary containing all task IDs and an optional error message.
        For the 'train' split, return 200 tasks; for other splits, return 812 tasks.
        """
        if split == "train":
            task_ids = [str(i) for i in range(200)]
        else:
            task_ids = [str(i) for i in range(812)]
        return {"task_ids": task_ids, "error_message": None}
    
    def cleanup(self):
        """
        Clean up benchmark resources by removing local results and log directories.
        """
        if os.path.exists(self.results_dir):
            self.logger.info(f"Removing {self.results_dir}")
            subprocess.run(['sudo', 'rm', '-rf', self.results_dir], check=True)
        if os.path.exists(self.log_files_dir):
            self.logger.info(f"Removing {self.log_files_dir}")
            subprocess.run(['sudo', 'rm', '-rf', self.log_files_dir], check=True)

API Reference

BaseBench Class

Method	Parameters	Returns	Description
`run_bench(task_id: str, agent_url: str, params: Dict[str, Any])`	`task_id`: Task identifier `agent_url`: Agent service endpoint `params`: Runtime parameters dictionary	`Dict[str, Any]`	Runs the benchmark inside a Docker container, captures logs, and returns the execution result.
`format_result(...)`	See implementation	`Dict[str, Any]`	Formats the benchmark result to include `task_id`, `is_resolved`, `score`, `message`, and `log`.
`get_volumes()`	None	`Dict[str, Dict[str, str]]`	Defines Docker volume mappings for results and log directories.
`validate_result(result: Dict[str, Any])`	`result`: Result dictionary	`bool`	Validates that the benchmark result contains all required fields.
Abstract Methods	See documentation	—	Must be implemented in your subclass: `get_config()`, `get_image_name()`, `get_results_dir_in_container()`, `get_log_files_dir_in_container()`, `get_result()`, `get_all_tasks()`, and `cleanup()`.

BaseBenchConfig Class

Used to define and validate the environment variables required for benchmark execution. Extend this class to customize the configuration by overriding required_env, optional_env, and defaults.

License

This project is licensed under the MIT License.

By following these steps, you can quickly implement and integrate your own AI benchmarks using the latest version of BaseBench. If you have any questions or suggestions, please feel free to submit an issue or pull request.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.3.2

Apr 23, 2026

0.3.1

Apr 22, 2026

0.3.0

Apr 21, 2026

0.3.0a10 pre-release

Apr 20, 2026

0.3.0a9 pre-release

Apr 20, 2026

0.3.0a8 pre-release

Apr 20, 2026

0.3.0a7 pre-release

Apr 20, 2026

0.3.0a6 pre-release

Apr 20, 2026

0.3.0a5 pre-release

Apr 20, 2026

0.3.0a4 pre-release

Apr 20, 2026

0.3.0a3 pre-release

Apr 20, 2026

0.3.0a2 pre-release

Apr 20, 2026

0.3.0a1 pre-release

Apr 20, 2026

0.2.3

Apr 16, 2026

0.2.2

Apr 14, 2026

0.2.1

Apr 13, 2026

0.2.0

Apr 9, 2026

0.1.13

Mar 10, 2025

0.1.12

Mar 6, 2025

0.1.11

Mar 6, 2025

0.1.10

Mar 6, 2025

0.1.9

Feb 28, 2025

0.1.8

Feb 27, 2025

0.1.7

Feb 19, 2025

0.1.6

Feb 17, 2025

This version

0.1.5

Feb 7, 2025

0.1.4

Feb 4, 2025

0.1.3

Jan 31, 2025

0.1.2

Jan 25, 2025

0.1.1

Jan 24, 2025

0.1.0

Jan 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchflow-0.1.5.tar.gz (48.1 kB view details)

Uploaded Feb 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

benchflow-0.1.5-py3-none-any.whl (16.4 kB view details)

Uploaded Feb 7, 2025 Python 3

File details

Details for the file benchflow-0.1.5.tar.gz.

File metadata

Download URL: benchflow-0.1.5.tar.gz
Upload date: Feb 7, 2025
Size: 48.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.26

File hashes

Hashes for benchflow-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`18af0c98f60e6684f54f89cf64f8db0deaff621d1ca64065633bcc9e6f03204a`
MD5	`02843944e52fbcfd92794183cb340ab2`
BLAKE2b-256	`a81cb23e7725c28db6752d8bd3043a57d936a6c89fadc50efac775831414f0ff`

See more details on using hashes here.

File details

Details for the file benchflow-0.1.5-py3-none-any.whl.

File metadata

Download URL: benchflow-0.1.5-py3-none-any.whl
Upload date: Feb 7, 2025
Size: 16.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.26

File hashes

Hashes for benchflow-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf6b0929ab07cca7ef61092293252756f314870e2e048bd3566b2e96bf1b44f0`
MD5	`93d57b679f53ccaa358c78454f0aa4a0`
BLAKE2b-256	`5aa046da9f3d9c9ffff962267c836dbfde19bd82b09f7c93fb683d65e831952a`

See more details on using hashes here.

benchflow 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BenchFlow: AI Benchmark Runtime

Table of Contents

Installation Requirements

Agent Development Guide

Step 1: Define Your Agent

Step 2: Test Your Agent

Benchmark Integration Guide

Step 1: Implement BenchClient

Step 2: Package and Upload Your Benchmark Docker Image

Step 3: Integrate Your Benchmark

API Reference

BaseBench Class

BaseBenchConfig Class

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes