OSWorld-MCP: A comprehensive MCP server for computer-use agents with 158 validated tools
Project description
OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents
🔔 Updates
2025-10-28: We released our paper and project page! 🎉
📄 Read the Paper | 🌐 Visit the Project Page
📑 Overview & Key Highlights
OSWorld-MCP is a comprehensive and fair benchmark for evaluating computer-use agents in real-world scenarios.
It jointly measures Model Context Protocol (MCP) tool invocation capabilities, graphical user interface (GUI) operation skills, and decision-making performance.
Designed as an extension of OSWorld, it significantly improves realism, balance, and comparability in evaluation.
Key Features & Findings
- 158 validated MCP tools, spanning 7 common applications (LibreOffice Writer, Calc, Impress, VS Code, Google Chrome, VLC, OS utilities). Among them, 25 distractor tools for robustness testing
- 250 tool-beneficial tasks → 69% of benchmark tasks benefit from MCP tools
- Multi-round tool invocation possible, posing real decision-making challenges
- MCP tools boost model accuracy & efficiency — e.g., OpenAI o3: 8.3% → 20.4% (15 steps)
- Highest observed Tool Invocation Rate (TIR) = 36.3% (Claude-4-Sonnet, 50 steps) → indicating ample room for improvement
- MCP tools improve agent metrics
- Higher tool invocation correlates with higher accuracy
- Combining tools introduces significant challenges
Architecture Overview
Figure: OSWorld-MCP evaluation framework integrating GUI actions and MCP tool invocations.
⚙️ Installation & Usage
1️⃣ Preparation: Code Setup
# Clone OSWorld base repo
git clone https://github.com/xlang-ai/OSWorld.git
# Clone OSWorld-MCP
git clone https://github.com/X-PLUG/OSWorld-MCP.git
Integrate OSWorld-MCP files into OSWorld to enable MCP support.
2️⃣ Preparation: Docker Environment
- Copy MCP files into
/homeinside Docker:
/home/
└── mcp_server/
└── osworld_mcp_client.py
- Install dependencies:
pip install -r requirements.txt
- Install Node.js
- Launch MCP server:
cd mcp_server
bash debug_server.sh
A successful launch opens the local MCP debug UI in your browser.
3️⃣ Running Evaluation
Example: Evaluate Claude 4 Sonnet (15 steps):
python run_multienv_e2e.py \
--api_url <your_api_url> \
--api_key <your_api_key> \
--model 'claude-sonnet-4-20250514-thinking' \
--test_all_meta_path 'evaluation_examples/test_all.json' \
--num_envs 1 \
--action_space mcp \
--max_steps 15 \
--max_trajectory_length 15
📐 Key Metrics
- Task Accuracy (Acc) — % of tasks successfully completed.
- Tool Invocation Rate (TIR) — correct decisions to use a tool or not.
- Average Completion Steps (ACS) — average number of actions per completed task.
📊 Leaderboard (Sorted by Accuracy)
🔗 Live Leaderboard: osworld-mcp.github.io
Max Steps: 15
| Model / Agent | Acc | TIR | ACS |
|---|---|---|---|
| Agent-S2.5 | 42.1 | 30.0 | 10.0 |
| Claude-4-Sonnet | 35.3 | 30.0 | 10.4 |
| Seed1.5-VL | 32.0 | 25.1 | 10.2 |
| Qwen3-VL | 31.3 | 24.5 | 10.5 |
| Gemini-2.5-Pro | 20.5 | 16.8 | 11.4 |
| OpenAI o3 | 20.4 | 16.7 | 11.6 |
| Qwen2.5-VL | 15.8 | 13.1 | 13.5 |
Max Steps: 50
| Model / Agent | Acc | TIR | ACS |
|---|---|---|---|
| Agent-S2.5 | 49.5 | 35.3 | 17.0 |
| Claude-4-Sonnet | 43.3 | 36.6 | 20.1 |
| Qwen3-VL | 39.1 | 29.5 | 21.1 |
| Seed1.5-VL | 38.4 | 29.0 | 23.0 |
| Gemini-2.5-Pro | 27.2 | 21.5 | 29.7 |
| OpenAI o3 | 25.2 | 21.0 | 32.1 |
| Qwen2.5-VL | 14.8 | 10.9 | 37.2 |
📚 Citation
@article{jia2025osworldmcp,
title={OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents},
author={Jia, Hongrui and Liao, Jitong and Zhang, Xi and Xu, Haiyang and Xie, Tianbao and Jiang, Chaoya and Yan, Ming and Liu, Si and Ye, Wei and Huang, Fei},
year={2025},
journal={arXiv preprint arXiv:2510.24563}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iflow_mcp_x_plug_osworld_mcp-0.1.1.tar.gz.
File metadata
- Download URL: iflow_mcp_x_plug_osworld_mcp-0.1.1.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9be6547d6fe6525763cb15ae3421ce98ae1d8a53175f9505516e5e928d1a8b4
|
|
| MD5 |
066c38862efb1d248a50e35a1bed2ec8
|
|
| BLAKE2b-256 |
6d6587e9e8804b4f797b91d01b3bd007c7fe0516ee98aea164cb96927c98b57c
|
File details
Details for the file iflow_mcp_x_plug_osworld_mcp-0.1.1-py3-none-any.whl.
File metadata
- Download URL: iflow_mcp_x_plug_osworld_mcp-0.1.1-py3-none-any.whl
- Upload date:
- Size: 72.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7b25414e42a337202de768c551fffad6f0d092ec857004a85c16073f9a2e9f6
|
|
| MD5 |
c17f243c90be7a4bf920d5438bc25f39
|
|
| BLAKE2b-256 |
9585bbb9a6d82902f568c099a64197c0e1ddbb2b89360f9b7d593253666f1dfa
|