mirrorbench

An automatic & extensible Framework to Evaluate User-Proxy Agents for Human-Likeness.

Project description

MirrorBench

Evaluating Realism of User-Proxy Agents

MirrorBench is an automatic, extensible Framework to Evaluate User-Proxy Agents for Human-Likeness. It provides a modular architecture to benchmark different User-Proxy Agents against a variety of realism metrics. MirrorBench is designed to be extensible, allowing researchers and developers to bring their own agents and metrics into the framework.

⭐ Drop a star to help us grow!

Requirements and Setup

The project requires Python 3.12 or higher. It is recommended to use a virtual environment to manage dependencies. You can install the project as a dependency using pip:

pip install https://github.com/SAP/mirrorbench.git

Alternatively, you can install it in editable/development mode by cloning the repository and installing it locally:

git clone https://github.com/SAP/mirrorbench.git

cd mirrorbench
pip install -e .[dev]

Quick Start

To get started with benchmarking your User-Proxy Agents, you can either use the code or CLI.

In order to run a benchmark, you need to define a job configuration in a YAML file. Below is an example of a simple job configuration:

# Job run settings (seed, sync/async, concurrency, cache, observability etc.)
run:
  name: my_run
  ...(trimmed for brevity)...

# Define User-Proxies to benchmark
user_proxies:
- name: proxy:langchain/claude-3.7-sonnet
  ...(trimmed for brevity)...

# Define datasets to use for benchmarking
datasets:
- name: dataset:jsonl/chatbot_arena_mirror
  ...(trimmed for brevity)...

# Define metrics
metrics:
- name: metric:judge/gteval
  ...(trimmed for brevity)...

task_drivers:
  dataset:jsonl/chatbot_arena_mirror:
    driver: task:mirror/conversation
    ...(trimmed for brevity)...

As shown above, the job configuration consists of several sections, including run, user_proxies, datasets, metrics, and task_drivers. Each section allows you to specify the components of your benchmark. You can find more examples of job configurations in the configs directory.

We provide a quick code snippet to run a benchmark using the above job configuration:

from mirrorbench.core.config import load_job_config
from mirrorbench.core.runner import Runner

job_cfg = load_job_config("path/to/your/job_config.yaml")
runner = Runner(job_cfg)
result_summary = runner.run()

LLM Usage

To use LLMs or any external API services, you will most likely need to set up and use API keys or authentication tokens. The package by default accesses environment variables addedn in .env file in your working directory. Alternatively, you can set the environment variables directly in your system using the following code snippet:

import os

os.environ["OPENAI_API_KEY"] = "your_openai_api_key"

The package has built-in support for Langchain based LLM clients. In case you would like to support other LLM clients, you can implement and register a custom LLM wrapper as shown for LangChainChatClient.

MirrorBench CLI

MirrorBench provides a command-line interface (CLI) to facilitate running benchmarks, managing runs & cache as well as validating job configs. Below, we provide an overview of the available commands. For detailed usage instructions, you can run mirrorbench --help.

`mirrorbench plan`

The mirrorbench plan command allows you to inspect and validate your job configuration file before executing a benchmarking job. It generates a summary file plan.json consisting of the components defined in the job configuration, including User-Proxies, datasets, metrics, and task drivers.

mirrorbench plan -c path/to/your/job_config.yaml

`mirrorbench dryrun`

The mirrorbench dryrun command allows you to perform a dry run with credential checks and dependency validation without actual execution of benchmarking tasks. As a result, it generates a manifest.json file containing detailed parsed information (units and episodes) which would be executed in a real run.

mirrorbench dryrun -c path/to/your/job_config.yaml

`mirrorbench run`

This command executes or resumes a benchmarking job based on the provided job configuration file. It manages the execution of tasks, computes metrics, and aggregates results.

# Execute a job from scratch
mirrorbench run -c path/to/your/job_config.yaml

# Resume a previously interrupted job
mirrorbench run -c path/to/your/job_config.yaml --resume

`mirrorbench report`

The CLI command mirrorbench report generates a comprehensive report of the benchmarking results from a completed run.

# Currently only JSON report generation is supported
mirrorbench report json <run-id> --output path/to/output/report.json

`mirrorbench runs`

The mirrorbench runs command has multiple subcommands to manage and inspect previous benchmarking runs. You can list all runs, view details of a specific run, or delete runs.

# List all previous runs
mirrorbench runs list

# Inspect the output of a specific episode of a run
mirrorbench runs inspect <run_id> --index <episode-index> --output episode.json

# Delete an existing run
mirrorbench runs delete <run_id> --force

`mirrorbench cache`

This command provides subcommands to check statistics of the cache or clear the cache.

# Show cache statistics
mirrorbench cache stats

# Clear the cache
mirrorbench cache purge

Cache is by default retained for 24 hours unless specified otherwise in the job configuration.

Support, Feedback, Contributing

This project is open to feature requests/suggestions, bug reports etc. via GitHub issues. Contribution and feedback are encouraged and always welcome. For more information about how to contribute, the project structure, as well as additional contribution information, see our Contribution Guidelines.

Security / Disclosure

If you find any bug that may be a security problem, please follow our instructions at in our security policy on how to report it. Please do not create GitHub issues for security-related doubts or problems.

Code of Conduct

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone. By participating in this project, you agree to abide by its Code of Conduct at all times.

Licensing

Copyright 2025 SAP SE or an SAP affiliate company and mirrorbench contributors. Please see our LICENSE for copyright and license information. Detailed information including third-party components and their licensing/copyright information is available via the REUSE tool.

Citation

If you like our work and find MirrorBench useful in your research, please consider citing the following paper:

@misc{hathidara2026mirrorbenchextensibleframeworkevaluate,
      title={MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness}, 
      author={Ashutosh Hathidara and Julien Yu and Vaishali Senthil and Sebastian Schreiber and Anil Babu Ankisettipalli},
      year={2026},
      eprint={2601.08118},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.08118}, 
}

Contributors

_{Ashutosh Hathidara}
🔬 💻 🎨 🤔 🚧

_{sebastian-schreiber-sap}
🤔 🧑‍🏫

_{Vaishali Senthil}
🤔

_aanilbabu
🤔 🧑‍🏫

_{Yue (Julien) Yu}
🤔

Project details

Release history Release notifications | RSS feed

This version

0.0.1

Jan 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mirrorbench-0.0.1.tar.gz (2.8 MB view details)

Uploaded Jan 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mirrorbench-0.0.1-py3-none-any.whl (156.4 kB view details)

Uploaded Jan 23, 2026 Python 3

File details

Details for the file mirrorbench-0.0.1.tar.gz.

File metadata

Download URL: mirrorbench-0.0.1.tar.gz
Upload date: Jan 23, 2026
Size: 2.8 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mirrorbench-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`779eb26040e7f9802b1ee06ecc5350d3fd994c3bd985da18f894a091b3b93360`
MD5	`14fb3af66251d2dd5d7d9c28f85de4c0`
BLAKE2b-256	`bc203ca38351a77baed843ddeaf42a79e154d56fc533da9eb3c06a8a6f2a2496`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mirrorbench-0.0.1.tar.gz:

Publisher: release.yml on SAP/mirrorbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mirrorbench-0.0.1.tar.gz
- Subject digest: 779eb26040e7f9802b1ee06ecc5350d3fd994c3bd985da18f894a091b3b93360
- Sigstore transparency entry: 847069071
- Sigstore integration time: Jan 23, 2026
Source repository:
- Permalink: SAP/mirrorbench@bc3c1a45783db47682a10e33ea049b8790c0036a
- Branch / Tag: refs/tags/v0.0.1
- Owner: https://github.com/SAP
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bc3c1a45783db47682a10e33ea049b8790c0036a
- Trigger Event: release

File details

Details for the file mirrorbench-0.0.1-py3-none-any.whl.

File metadata

Download URL: mirrorbench-0.0.1-py3-none-any.whl
Upload date: Jan 23, 2026
Size: 156.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mirrorbench-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0ac1517d87c392a28ee25522e74eb22dcbe1bad784923be22f566e6b88615de3`
MD5	`b496615e2d5613839685c0fe94756891`
BLAKE2b-256	`b76f335ab013b0b19b305b94384d48bf845e511dc2ec85f0f9318dbe54aabb3a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mirrorbench-0.0.1-py3-none-any.whl:

Publisher: release.yml on SAP/mirrorbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mirrorbench-0.0.1-py3-none-any.whl
- Subject digest: 0ac1517d87c392a28ee25522e74eb22dcbe1bad784923be22f566e6b88615de3
- Sigstore transparency entry: 847069114
- Sigstore integration time: Jan 23, 2026
Source repository:
- Permalink: SAP/mirrorbench@bc3c1a45783db47682a10e33ea049b8790c0036a
- Branch / Tag: refs/tags/v0.0.1
- Owner: https://github.com/SAP
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bc3c1a45783db47682a10e33ea049b8790c0036a
- Trigger Event: release

mirrorbench 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Evaluating Realism of User-Proxy Agents

Requirements and Setup

Quick Start

LLM Usage

MirrorBench CLI

mirrorbench plan

mirrorbench dryrun

mirrorbench run

mirrorbench report

mirrorbench runs

mirrorbench cache

Support, Feedback, Contributing

Security / Disclosure

Code of Conduct

Licensing

Citation

Contributors

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`mirrorbench plan`

`mirrorbench dryrun`

`mirrorbench run`

`mirrorbench report`

`mirrorbench runs`

`mirrorbench cache`