ToolGym: An Open-world Tool-using Environment for LLM Agent Evaluation

Project description

ToolGym

An Open-world Tool-using Environment for Scalable Agent Testing

Overview

ToolGym is a large-scale, open-world benchmark for evaluating LLM agents' tool-using capabilities. Built on 5,571 real tools across 204 applications, ToolGym enables realistic testing with:

Long-horizon workflows: Multi-step tasks requiring complex tool coordination
Wild constraints: Natural language requirements that must be satisfied
Robustness testing: State Controller for systematic perturbation testing

Key Statistics

Metric	Value
Total Tools	5,571
Applications	204
Task Instances	3,091
Avg. Tools per Task	4.77
Avg. Steps per Task	7.46

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         ToolGym                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────┐    ┌─────────────────┐    ┌──────────────┐ │
│  │ Task Creation   │    │ Tool Retrieval  │    │    State     │ │
│  │    Engine       │    │     Index       │    │  Controller  │ │
│  │                 │    │                 │    │              │ │
│  │ • Workflow      │    │ • BGE-M3        │    │ • Tool-level │ │
│  │   Synthesis     │    │ • FAISS         │    │ • State-level│ │
│  │ • Constraint    │    │ • 5,571 tools   │    │ • Constraint │ │
│  │   Generation    │    │                 │    │   -level     │ │
│  └─────────────────┘    └─────────────────┘    └──────────────┘ │
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│                    Planner-Actor Framework                       │
│  ┌─────────────────┐              ┌─────────────────────────┐   │
│  │     Planner     │ ──prompts──▶ │         Actor           │   │
│  │  (Decomposes    │              │  (Executes tools via    │   │
│  │   into subtasks)│ ◀─feedback── │   ReAct reasoning)      │   │
│  └─────────────────┘              └─────────────────────────┘   │
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│                       LLM-as-Judge                               │
│            Multi-model evaluation with majority voting           │
└─────────────────────────────────────────────────────────────────┘

Installation

# Clone the repository
git clone https://github.com/Ziqiao-git/ToolGym.git
cd ToolGym

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your API keys

Quick Start

Running an Agent

# Basic usage with semantic tool discovery
python runtime/run_react_agent.py "Search for latest AI news"

# With trajectory logging
python runtime/run_react_agent.py "Find GitHub repos about ML" --save-trajectory

# Custom model
python runtime/run_react_agent.py "Your query" \
  --model anthropic/claude-3.5-sonnet \
  --max-iterations 10

Core Components

1. Task Creation Engine

Synthesizes realistic, long-horizon tasks through:

Workflow synthesis: Chains tool calls into coherent task sequences
Constraint generation: Adds natural language requirements
Diversity sampling: Ensures coverage across tool categories

Location: task_creation_engine/

2. Tool Retrieval Index

Semantic search over 5,571 tools using:

Embeddings: BGE-M3 (multilingual, 1024 dimensions)
Index: FAISS for efficient similarity search
Dynamic loading: On-demand MCP server connections

Location: tool_retrieval_index/

3. State Controller

Systematic robustness testing with three control types:

Control Type	Strategies
Tool-level	Timeout, Rate limit, Unavailable, Schema change, Partial failure
State-level	Response delay, Data corruption, Truncation, Session timeout, Stale data
Constraint-level	Add constraint, Modify constraint, Tighten deadline, Resource limit

Location: toolgym/state_controller/

4. Planner-Actor Framework

Two-stage agent architecture:

Planner: Decomposes tasks into subtask sequences
Actor: Executes subtasks using ReAct reasoning with tool calls

Location: Orchestrator/mcpuniverse/agent/

5. LLM-as-Judge Evaluation

Multi-dimensional evaluation with:

5 scoring dimensions: Task fulfillment, Grounding, Tool choice, Tool execution, Requirement satisfaction
Multi-model voting: Uses multiple LLM judges for robustness
Majority voting: Final score from consensus

Location: Orchestrator/mcpuniverse/evaluator/

Project Structure

ToolGym/
├── README.md                    # This file
├── docs/                        # GitHub Pages website
│   └── index.html              # Leaderboard & documentation
│
├── task_creation_engine/        # Task synthesis
│   └── query_generate.py       # Workflow generation
│
├── tool_retrieval_index/        # Semantic tool search
│   └── server.py               # MCP server with search
│
├── toolgym/                     # Core library
│   └── state_controller/       # Robustness testing
│
├── Orchestrator/                # Agent framework
│   └── mcpuniverse/
│       ├── agent/              # Planner-Actor implementation
│       └── evaluator/          # LLM-as-Judge
│
├── MCP_INFO_MGR/                # Tool data management
│   ├── mcp_data/               # Tool metadata
│   └── semantic_search/        # FAISS index
│
├── runtime/                     # Agent runtime
│   └── run_react_agent.py      # CLI interface
│
└── evaluation/                  # Evaluation scripts

Dataset

The ToolGym dataset is available on HuggingFace:

🤗 ToolGym

Contents:

3,091 task instances with ground-truth tool sequences
Tool metadata for 5,571 tools across 204 applications
Constraint annotations and perturbation configurations

Citation

@inproceedings{toolgym2025,
  title={ToolGym: An Open-world Tool-using Environment for LLM Agent Evaluation},
  author={...},
  booktitle={Proceedings of ACL 2025},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built on the Model Context Protocol (MCP) ecosystem
Tool data sourced from Smithery and other MCP registries
Evaluation framework inspired by recent LLM-as-Judge research

Website: https://ziqiao-git.github.io/ToolGym/ Dataset: https://huggingface.co/ToolGym GitHub: https://github.com/Ziqiao-git/ToolGym

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Feb 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iflow_mcp_ziqiao_git_toolgym-0.1.0.tar.gz (46.8 MB view details)

Uploaded Feb 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

iflow_mcp_ziqiao_git_toolgym-0.1.0-py3-none-any.whl (46.9 MB view details)

Uploaded Feb 26, 2026 Python 3

File details

Details for the file iflow_mcp_ziqiao_git_toolgym-0.1.0.tar.gz.

File metadata

Download URL: iflow_mcp_ziqiao_git_toolgym-0.1.0.tar.gz
Upload date: Feb 26, 2026
Size: 46.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_ziqiao_git_toolgym-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9ebba9331daffd00a3fa5a7a944e73f98a9e988cb080c603c3d0ddc146503fa6`
MD5	`50791f25970d8058d93ed58075332a6e`
BLAKE2b-256	`947e83be0ef4ddb67ca91c7069927b8ce785037e6bf0f9ba268b0516ec3e167c`

See more details on using hashes here.

File details

Details for the file iflow_mcp_ziqiao_git_toolgym-0.1.0-py3-none-any.whl.

File metadata

Download URL: iflow_mcp_ziqiao_git_toolgym-0.1.0-py3-none-any.whl
Upload date: Feb 26, 2026
Size: 46.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_ziqiao_git_toolgym-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8145c71bdf087598da7a5fdb781ee0d5b4b099eeb91cabf2e11ea10bdaf530d3`
MD5	`f7177970db5bda327177c9fa15756230`
BLAKE2b-256	`95e2115797ed59ba64e1da85dcac672395bde92373b72ab3ea99ba569ad42827`

See more details on using hashes here.

iflow-mcp_ziqiao-git-toolgym 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

ToolGym

Overview

Key Statistics

Architecture

Installation

Quick Start

Running an Agent

Core Components

1. Task Creation Engine

2. Tool Retrieval Index

3. State Controller

4. Planner-Actor Framework

5. LLM-as-Judge Evaluation

Project Structure

Dataset

Citation

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes