Skip to main content

ToolGym: An Open-world Tool-using Environment for LLM Agent Evaluation

Project description

ToolGym

An Open-world Tool-using Environment for Scalable Agent Testing

Paper Dataset Website

Overview

ToolGym is a large-scale, open-world benchmark for evaluating LLM agents' tool-using capabilities. Built on 5,571 real tools across 204 applications, ToolGym enables realistic testing with:

  • Long-horizon workflows: Multi-step tasks requiring complex tool coordination
  • Wild constraints: Natural language requirements that must be satisfied
  • Robustness testing: State Controller for systematic perturbation testing

Key Statistics

Metric Value
Total Tools 5,571
Applications 204
Task Instances 3,091
Avg. Tools per Task 4.77
Avg. Steps per Task 7.46

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         ToolGym                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────┐    ┌─────────────────┐    ┌──────────────┐ │
│  │ Task Creation   │    │ Tool Retrieval  │    │    State     │ │
│  │    Engine       │    │     Index       │    │  Controller  │ │
│  │                 │    │                 │    │              │ │
│  │ • Workflow      │    │ • BGE-M3        │    │ • Tool-level │ │
│  │   Synthesis     │    │ • FAISS         │    │ • State-level│ │
│  │ • Constraint    │    │ • 5,571 tools   │    │ • Constraint │ │
│  │   Generation    │    │                 │    │   -level     │ │
│  └─────────────────┘    └─────────────────┘    └──────────────┘ │
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│                    Planner-Actor Framework                       │
│  ┌─────────────────┐              ┌─────────────────────────┐   │
│  │     Planner     │ ──prompts──▶ │         Actor           │   │
│  │  (Decomposes    │              │  (Executes tools via    │   │
│  │   into subtasks)│ ◀─feedback── │   ReAct reasoning)      │   │
│  └─────────────────┘              └─────────────────────────┘   │
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│                       LLM-as-Judge                               │
│            Multi-model evaluation with majority voting           │
└─────────────────────────────────────────────────────────────────┘

Installation

# Clone the repository
git clone https://github.com/Ziqiao-git/ToolGym.git
cd ToolGym

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your API keys

Quick Start

Running an Agent

# Basic usage with semantic tool discovery
python runtime/run_react_agent.py "Search for latest AI news"

# With trajectory logging
python runtime/run_react_agent.py "Find GitHub repos about ML" --save-trajectory

# Custom model
python runtime/run_react_agent.py "Your query" \
  --model anthropic/claude-3.5-sonnet \
  --max-iterations 10

Core Components

1. Task Creation Engine

Synthesizes realistic, long-horizon tasks through:

  • Workflow synthesis: Chains tool calls into coherent task sequences
  • Constraint generation: Adds natural language requirements
  • Diversity sampling: Ensures coverage across tool categories

Location: task_creation_engine/

2. Tool Retrieval Index

Semantic search over 5,571 tools using:

  • Embeddings: BGE-M3 (multilingual, 1024 dimensions)
  • Index: FAISS for efficient similarity search
  • Dynamic loading: On-demand MCP server connections

Location: tool_retrieval_index/

3. State Controller

Systematic robustness testing with three control types:

Control Type Strategies
Tool-level Timeout, Rate limit, Unavailable, Schema change, Partial failure
State-level Response delay, Data corruption, Truncation, Session timeout, Stale data
Constraint-level Add constraint, Modify constraint, Tighten deadline, Resource limit

Location: toolgym/state_controller/

4. Planner-Actor Framework

Two-stage agent architecture:

  • Planner: Decomposes tasks into subtask sequences
  • Actor: Executes subtasks using ReAct reasoning with tool calls

Location: Orchestrator/mcpuniverse/agent/

5. LLM-as-Judge Evaluation

Multi-dimensional evaluation with:

  • 5 scoring dimensions: Task fulfillment, Grounding, Tool choice, Tool execution, Requirement satisfaction
  • Multi-model voting: Uses multiple LLM judges for robustness
  • Majority voting: Final score from consensus

Location: Orchestrator/mcpuniverse/evaluator/

Project Structure

ToolGym/
├── README.md                    # This file
├── docs/                        # GitHub Pages website
│   └── index.html              # Leaderboard & documentation
│
├── task_creation_engine/        # Task synthesis
│   └── query_generate.py       # Workflow generation
│
├── tool_retrieval_index/        # Semantic tool search
│   └── server.py               # MCP server with search
│
├── toolgym/                     # Core library
│   └── state_controller/       # Robustness testing
│
├── Orchestrator/                # Agent framework
│   └── mcpuniverse/
│       ├── agent/              # Planner-Actor implementation
│       └── evaluator/          # LLM-as-Judge
│
├── MCP_INFO_MGR/                # Tool data management
│   ├── mcp_data/               # Tool metadata
│   └── semantic_search/        # FAISS index
│
├── runtime/                     # Agent runtime
│   └── run_react_agent.py      # CLI interface
│
└── evaluation/                  # Evaluation scripts

Dataset

The ToolGym dataset is available on HuggingFace:

🤗 ToolGym

Contents:

  • 3,091 task instances with ground-truth tool sequences
  • Tool metadata for 5,571 tools across 204 applications
  • Constraint annotations and perturbation configurations

Citation

@inproceedings{toolgym2025,
  title={ToolGym: An Open-world Tool-using Environment for LLM Agent Evaluation},
  author={...},
  booktitle={Proceedings of ACL 2025},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built on the Model Context Protocol (MCP) ecosystem
  • Tool data sourced from Smithery and other MCP registries
  • Evaluation framework inspired by recent LLM-as-Judge research

Website: https://ziqiao-git.github.io/ToolGym/ Dataset: https://huggingface.co/ToolGym GitHub: https://github.com/Ziqiao-git/ToolGym

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iflow_mcp_ziqiao_git_toolgym-0.1.0.tar.gz (46.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iflow_mcp_ziqiao_git_toolgym-0.1.0-py3-none-any.whl (46.9 MB view details)

Uploaded Python 3

File details

Details for the file iflow_mcp_ziqiao_git_toolgym-0.1.0.tar.gz.

File metadata

  • Download URL: iflow_mcp_ziqiao_git_toolgym-0.1.0.tar.gz
  • Upload date:
  • Size: 46.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_ziqiao_git_toolgym-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9ebba9331daffd00a3fa5a7a944e73f98a9e988cb080c603c3d0ddc146503fa6
MD5 50791f25970d8058d93ed58075332a6e
BLAKE2b-256 947e83be0ef4ddb67ca91c7069927b8ce785037e6bf0f9ba268b0516ec3e167c

See more details on using hashes here.

File details

Details for the file iflow_mcp_ziqiao_git_toolgym-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: iflow_mcp_ziqiao_git_toolgym-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 46.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_ziqiao_git_toolgym-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8145c71bdf087598da7a5fdb781ee0d5b4b099eeb91cabf2e11ea10bdaf530d3
MD5 f7177970db5bda327177c9fa15756230
BLAKE2b-256 95e2115797ed59ba64e1da85dcac672395bde92373b72ab3ea99ba569ad42827

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page