An automatic & extensible Framework to Evaluate User-Proxy Agents for Human-Likeness.
Project description
Evaluating Realism of User-Proxy Agents
MirrorBench is an automatic, extensible Framework to Evaluate User-Proxy Agents for Human-Likeness. It provides a modular architecture to benchmark different User-Proxy Agents against a variety of realism metrics. MirrorBench is designed to be extensible, allowing researchers and developers to bring their own agents and metrics into the framework.
⭐ Drop a star to help us grow!
Requirements and Setup
The project requires Python 3.12 or higher. It is recommended to use a virtual environment to manage dependencies. You can install the project as a dependency using pip:
pip install https://github.com/SAP/mirrorbench.git
Alternatively, you can install it in editable/development mode by cloning the repository and installing it locally:
git clone https://github.com/SAP/mirrorbench.git
cd mirrorbench
pip install -e .[dev]
Quick Start
To get started with benchmarking your User-Proxy Agents, you can either use the code or CLI.
In order to run a benchmark, you need to define a job configuration in a YAML file. Below is an example of a simple job configuration:
# Job run settings (seed, sync/async, concurrency, cache, observability etc.)
run:
name: my_run
...(trimmed for brevity)...
# Define User-Proxies to benchmark
user_proxies:
- name: proxy:langchain/claude-3.7-sonnet
...(trimmed for brevity)...
# Define datasets to use for benchmarking
datasets:
- name: dataset:jsonl/chatbot_arena_mirror
...(trimmed for brevity)...
# Define metrics
metrics:
- name: metric:judge/gteval
...(trimmed for brevity)...
task_drivers:
dataset:jsonl/chatbot_arena_mirror:
driver: task:mirror/conversation
...(trimmed for brevity)...
As shown above, the job configuration consists of several sections, including run, user_proxies, datasets, metrics, and task_drivers. Each section allows you to specify the components of your benchmark. You can find more examples of job configurations in the configs directory.
We provide a quick code snippet to run a benchmark using the above job configuration:
from mirrorbench.core.config import load_job_config
from mirrorbench.core.runner import Runner
job_cfg = load_job_config("path/to/your/job_config.yaml")
runner = Runner(job_cfg)
result_summary = runner.run()
LLM Usage
To use LLMs or any external API services, you will most likely need to set up and use API keys or authentication tokens. The package by default accesses environment variables addedn in .env file in your working directory. Alternatively, you can set the environment variables directly in your system using the following code snippet:
import os
os.environ["OPENAI_API_KEY"] = "your_openai_api_key"
The package has built-in support for Langchain based LLM clients. In case you would like to support other LLM clients, you can implement and register a custom LLM wrapper as shown for LangChainChatClient.
MirrorBench CLI
MirrorBench provides a command-line interface (CLI) to facilitate running benchmarks, managing runs & cache as well as validating job configs. Below, we provide an overview of the available commands. For detailed usage instructions, you can run mirrorbench --help.
mirrorbench plan
The mirrorbench plan command allows you to inspect and validate your job configuration file before executing a benchmarking job. It generates a summary file plan.json consisting of the components defined in the job configuration, including User-Proxies, datasets, metrics, and task drivers.
mirrorbench plan -c path/to/your/job_config.yaml
mirrorbench dryrun
The mirrorbench dryrun command allows you to perform a dry run with credential checks and dependency validation without actual execution of benchmarking tasks. As a result, it generates a manifest.json file containing detailed parsed information (units and episodes) which would be executed in a real run.
mirrorbench dryrun -c path/to/your/job_config.yaml
mirrorbench run
This command executes or resumes a benchmarking job based on the provided job configuration file. It manages the execution of tasks, computes metrics, and aggregates results.
# Execute a job from scratch
mirrorbench run -c path/to/your/job_config.yaml
# Resume a previously interrupted job
mirrorbench run -c path/to/your/job_config.yaml --resume
mirrorbench report
The CLI command mirrorbench report generates a comprehensive report of the benchmarking results from a completed run.
# Currently only JSON report generation is supported
mirrorbench report json <run-id> --output path/to/output/report.json
mirrorbench runs
The mirrorbench runs command has multiple subcommands to manage and inspect previous benchmarking runs. You can list all runs, view details of a specific run, or delete runs.
# List all previous runs
mirrorbench runs list
# Inspect the output of a specific episode of a run
mirrorbench runs inspect <run_id> --index <episode-index> --output episode.json
# Delete an existing run
mirrorbench runs delete <run_id> --force
mirrorbench cache
This command provides subcommands to check statistics of the cache or clear the cache.
# Show cache statistics
mirrorbench cache stats
# Clear the cache
mirrorbench cache purge
Cache is by default retained for 24 hours unless specified otherwise in the job configuration.
Support, Feedback, Contributing
This project is open to feature requests/suggestions, bug reports etc. via GitHub issues. Contribution and feedback are encouraged and always welcome. For more information about how to contribute, the project structure, as well as additional contribution information, see our Contribution Guidelines.
Security / Disclosure
If you find any bug that may be a security problem, please follow our instructions at in our security policy on how to report it. Please do not create GitHub issues for security-related doubts or problems.
Code of Conduct
We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone. By participating in this project, you agree to abide by its Code of Conduct at all times.
Licensing
Copyright 2025 SAP SE or an SAP affiliate company and mirrorbench contributors. Please see our LICENSE for copyright and license information. Detailed information including third-party components and their licensing/copyright information is available via the REUSE tool.
Citation
If you like our work and find MirrorBench useful in your research, please consider citing the following paper:
@misc{hathidara2026mirrorbenchextensibleframeworkevaluate,
title={MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness},
author={Ashutosh Hathidara and Julien Yu and Vaishali Senthil and Sebastian Schreiber and Anil Babu Ankisettipalli},
year={2026},
eprint={2601.08118},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.08118},
}
Contributors
Ashutosh Hathidara 🔬 💻 🎨 🤔 🚧 |
sebastian-schreiber-sap 🤔 🧑🏫 |
Vaishali Senthil 🤔 |
aanilbabu 🤔 🧑🏫 |
Yue (Julien) Yu 🤔 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mirrorbench-0.0.1.tar.gz.
File metadata
- Download URL: mirrorbench-0.0.1.tar.gz
- Upload date:
- Size: 2.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
779eb26040e7f9802b1ee06ecc5350d3fd994c3bd985da18f894a091b3b93360
|
|
| MD5 |
14fb3af66251d2dd5d7d9c28f85de4c0
|
|
| BLAKE2b-256 |
bc203ca38351a77baed843ddeaf42a79e154d56fc533da9eb3c06a8a6f2a2496
|
Provenance
The following attestation bundles were made for mirrorbench-0.0.1.tar.gz:
Publisher:
release.yml on SAP/mirrorbench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mirrorbench-0.0.1.tar.gz -
Subject digest:
779eb26040e7f9802b1ee06ecc5350d3fd994c3bd985da18f894a091b3b93360 - Sigstore transparency entry: 847069071
- Sigstore integration time:
-
Permalink:
SAP/mirrorbench@bc3c1a45783db47682a10e33ea049b8790c0036a -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/SAP
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bc3c1a45783db47682a10e33ea049b8790c0036a -
Trigger Event:
release
-
Statement type:
File details
Details for the file mirrorbench-0.0.1-py3-none-any.whl.
File metadata
- Download URL: mirrorbench-0.0.1-py3-none-any.whl
- Upload date:
- Size: 156.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ac1517d87c392a28ee25522e74eb22dcbe1bad784923be22f566e6b88615de3
|
|
| MD5 |
b496615e2d5613839685c0fe94756891
|
|
| BLAKE2b-256 |
b76f335ab013b0b19b305b94384d48bf845e511dc2ec85f0f9318dbe54aabb3a
|
Provenance
The following attestation bundles were made for mirrorbench-0.0.1-py3-none-any.whl:
Publisher:
release.yml on SAP/mirrorbench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mirrorbench-0.0.1-py3-none-any.whl -
Subject digest:
0ac1517d87c392a28ee25522e74eb22dcbe1bad784923be22f566e6b88615de3 - Sigstore transparency entry: 847069114
- Sigstore integration time:
-
Permalink:
SAP/mirrorbench@bc3c1a45783db47682a10e33ea049b8790c0036a -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/SAP
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bc3c1a45783db47682a10e33ea049b8790c0036a -
Trigger Event:
release
-
Statement type: