A tool for easy benchmarking.
Project description
BenchFlow: AI Benchmark Runtime
BenchFlow is an AI benchmark runtime framework that allows you to integrate and evaluate AI tasks using Docker-based benchmarks. The latest version leverages a new BaseBench design to manage logs, results, and environment variable configurations consistently.
Table of Contents
Installation Requirements
- Python 3.11+
- Docker
Install the BenchFlow package using pip:
pip install benchflow
Agent Development Guide
Step 1: Define Your Agent
Create your Agent by extending BaseAgent. The Agent processes the environment data provided via self.env_info and generates a solution for the task.
from benchflow import BaseAgent
class YourAgent(BaseAgent):
def __init__(self):
super().__init__()
def call_api(self) -> str:
"""
IMPLEMENTATION CONTRACT:
Process environment data and generate task solution.
Access:
- self.env_info: dict containing benchmark-specific data
Returns:
str: Unified diff patch or any prediction as a formatted string.
"""
# Access task parameters
instance_id = self.env_info['instance_id']
# Process the data provided in `env_info` and return your prediction
return (
"diff --git a/src/rules/L031.py b/src/rules/L031.py\n"
"--- a/src/rules/L031.py\n"
"+++ b/src/rules/L031.py\n"
"@@ -211,7 +211,7 @@ def _lint_aliases_in_join(\n"
" violation_buff.append(\n"
" LintResult(\n"
" anchor=alias_info.alias_identifier_ref,\n"
" - description=\"Original message\",\n"
" + description=\"Updated message\",\n"
" fixes=fixes,\n"
" )\n"
" )"
)
Step 2: Test Your Agent
Test your Agent with the benchmark by loading the benchmark module and running the evaluation.
from benchflow import load_benchmark
from your_agent import YourAgent
# Initialize the benchmark (for example, "SWE-Bench")
bench = load_benchmark("SWE-Bench")
# Instantiate your agent
agent = YourAgent()
# Define execution parameters
config = {
"task_ids": ["astropy__astropy-12907"],
"agents": agent,
"install_sh_dir": "setup.sh",
"requirements_dir": "requirements.txt",
"api": {"OPENAI_API_KEY": "your_api_key_here"}
}
# Run the evaluation
results = bench.run(**config)
Benchmark Integration Guide
The Benchmark Integration Guide now comprises three steps:
Step 1: Implement BenchClient
Create a class extending BenchClient to transform the raw state into the agent's input and parse the agent's output.
from benchflow import BenchClient
from typing import Dict, Any
class YourClient(BenchClient):
def prepare_environment(self, state_update: Dict) -> Dict:
"""Transform raw state into agent inputs."""
return {
"env_info": {
"observation": state_update["trajectory"][-1],
"intent": state_update.get("intent", "")
}
}
def parse_action(self, raw_action: str) -> str:
"""Process the agent response."""
parsed_action = raw_action # Optionally add post-processing here
return parsed_action
Step 2: Package and Upload Your Benchmark Docker Image
Before integrating your benchmark, ensure that you have:
- Packaged your benchmark logic into a Docker image.
- Configured the image to read required environment variables (such as
AGENT_URL,TEST_START_IDX, etc.). - Uploaded the Docker image to a public registry (e.g., DockerHub).
For example, tag your image as yourusername/benchmark-name:tag. No code snippet is required for this step.
Step 3: Integrate Your Benchmark
Integrate your benchmark by subclassing BaseBench. In the new implementation, you must implement the following abstract methods:
-
get_config(params: Dict[str, Any], task_id: str) -> BaseBenchConfig
Returns a configuration instance (derived fromBaseBenchConfig) to validate and prepare environment variables. -
get_image_name() -> str
Returns the Docker image name for running the benchmark. -
get_results_dir_in_container() -> str
Returns the directory inside the container where results will be stored. -
get_log_files_dir_in_container() -> str
Returns the directory inside the container where log files will be stored. -
get_result(task_id: str) -> Dict[str, Any]
Reads and parses the benchmark results (for example, from log files) and returns a dictionary containing:task_idis_resolved(a boolean indicating success)score(a numerical score)message(a dictionary with details or error messages)log(log details as a string)
-
get_all_tasks(split: str) -> Dict[str, Any]
Returns all available task IDs and an optional error message. -
cleanup()
Cleans up any temporary resources created during benchmark execution.
Below is an example integration using WebArenaBench:
# webarena_bench.py
import os
import subprocess
from typing import Any, Dict
from benchflow import BaseBench, BaseBenchConfig
# ------------------------------------------------------------------------------
# WebArenaConfig: Define the configuration for WebArenaBench.
# ------------------------------------------------------------------------------
class WebArenaConfig(BaseBenchConfig):
# For this benchmark, we require the TEST_END_IDX variable.
required_env = ["TEST_END_IDX"]
optional_env = []
defaults = {
"RESULTS_DIR": "/app/results"
}
# ------------------------------------------------------------------------------
# WebArenaBench Implementation
# ------------------------------------------------------------------------------
class WebArenaBench(BaseBench):
def __init__(self):
super().__init__()
def get_config(self, params: Dict[str, Any], task_id: str) -> BaseBenchConfig:
"""
Return a WebArenaConfig instance that validates the input parameters.
Here, we set TEST_END_IDX so that each run processes only one task.
"""
params["TEST_END_IDX"] = str(int(task_id) + 1)
return WebArenaConfig(params)
def get_image_name(self) -> str:
"""
Return the Docker image name for running the WebArena benchmark.
"""
return "kirk2000/benchflow:webarena-v1"
def get_results_dir_in_container(self) -> str:
"""
Return the directory inside the container where benchmark results will be stored.
"""
return "/app/results"
def get_log_files_dir_in_container(self) -> str:
"""
Return the directory inside the container where log files will be stored.
"""
return "/app/log_files"
def get_result(self, task_id: str) -> Dict[str, Any]:
"""
Read and parse the benchmark result from log files.
This method expects a file named 'log_files.txt' in the results directory.
It reads the content of each log file listed, aggregates the logs, and extracts
the average score and pass status.
"""
log_files_txt = os.path.join(self.results_dir, "log_files.txt")
if not os.path.exists(log_files_txt):
return {
"is_resolved": False,
"score": 0,
"message": {"error": "No results found"},
"log": ""
}
log_content = ""
try:
with open(log_files_txt, 'r') as f:
for line in f:
log_file_name = os.path.basename(line.strip())
# Assume log files are located in the log_files directory under the task_id folder.
full_log_path = os.path.join(self.log_files_dir, str(task_id), log_file_name)
with open(full_log_path, 'r') as log_file:
log_content += log_file.read() + "\n"
except Exception as e:
return {
"is_resolved": False,
"score": 0,
"message": {"error": f"Failed to read log files: {e}"},
"log": log_content
}
# Parse the log content to extract score and resolution status.
is_resolved = False
score = 0.0
for line in log_content.splitlines():
if "Average score:" in line:
try:
score = float(line.split(":")[-1].strip())
except ValueError:
score = 0.0
if "[Result]" in line and "(PASS)" in line:
is_resolved = True
return {
"is_resolved": is_resolved,
"score": score,
"message": {"details": "Task runs successfully."},
"log": log_content
}
def get_all_tasks(self, split: str) -> Dict[str, Any]:
"""
Return a dictionary containing all task IDs and an optional error message.
For the 'train' split, return 200 tasks; for other splits, return 812 tasks.
"""
if split == "train":
task_ids = [str(i) for i in range(200)]
else:
task_ids = [str(i) for i in range(812)]
return {"task_ids": task_ids, "error_message": None}
def cleanup(self):
"""
Clean up benchmark resources by removing local results and log directories.
"""
if os.path.exists(self.results_dir):
self.logger.info(f"Removing {self.results_dir}")
subprocess.run(['sudo', 'rm', '-rf', self.results_dir], check=True)
if os.path.exists(self.log_files_dir):
self.logger.info(f"Removing {self.log_files_dir}")
subprocess.run(['sudo', 'rm', '-rf', self.log_files_dir], check=True)
API Reference
BaseBench Class
| Method | Parameters | Returns | Description |
|---|---|---|---|
run_bench(task_id: str, agent_url: str, params: Dict[str, Any]) |
task_id: Task identifieragent_url: Agent service endpointparams: Runtime parameters dictionary |
Dict[str, Any] |
Runs the benchmark inside a Docker container, captures logs, and returns the execution result. |
format_result(...) |
See implementation | Dict[str, Any] |
Formats the benchmark result to include task_id, is_resolved, score, message, and log. |
get_volumes() |
None | Dict[str, Dict[str, str]] |
Defines Docker volume mappings for results and log directories. |
validate_result(result: Dict[str, Any]) |
result: Result dictionary |
bool |
Validates that the benchmark result contains all required fields. |
| Abstract Methods | See documentation | — | Must be implemented in your subclass: get_config(), get_image_name(), get_results_dir_in_container(), get_log_files_dir_in_container(), get_result(), get_all_tasks(), and cleanup(). |
BaseBenchConfig Class
Used to define and validate the environment variables required for benchmark execution. Extend this class to customize the configuration by overriding required_env, optional_env, and defaults.
License
This project is licensed under the MIT License.
By following these steps, you can quickly implement and integrate your own AI benchmarks using the latest version of BaseBench. If you have any questions or suggestions, please feel free to submit an issue or pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file benchflow-0.1.5.tar.gz.
File metadata
- Download URL: benchflow-0.1.5.tar.gz
- Upload date:
- Size: 48.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.26
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18af0c98f60e6684f54f89cf64f8db0deaff621d1ca64065633bcc9e6f03204a
|
|
| MD5 |
02843944e52fbcfd92794183cb340ab2
|
|
| BLAKE2b-256 |
a81cb23e7725c28db6752d8bd3043a57d936a6c89fadc50efac775831414f0ff
|
File details
Details for the file benchflow-0.1.5-py3-none-any.whl.
File metadata
- Download URL: benchflow-0.1.5-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.26
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf6b0929ab07cca7ef61092293252756f314870e2e048bd3566b2e96bf1b44f0
|
|
| MD5 |
93d57b679f53ccaa358c78454f0aa4a0
|
|
| BLAKE2b-256 |
5aa046da9f3d9c9ffff962267c836dbfde19bd82b09f7c93fb683d65e831952a
|