Skip to main content

OSWorld-MCP: A comprehensive MCP server for computer-use agents with 158 validated tools

Project description

OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents

🔔 Updates

2025-10-28: We released our paper and project page! 🎉

📄 Read the Paper  |  🌐 Visit the Project Page


📑 Overview & Key Highlights

OSWorld-MCP is a comprehensive and fair benchmark for evaluating computer-use agents in real-world scenarios.
It jointly measures Model Context Protocol (MCP) tool invocation capabilities, graphical user interface (GUI) operation skills, and decision-making performance.
Designed as an extension of OSWorld, it significantly improves realism, balance, and comparability in evaluation.

Key Features & Findings

  • 158 validated MCP tools, spanning 7 common applications (LibreOffice Writer, Calc, Impress, VS Code, Google Chrome, VLC, OS utilities). Among them, 25 distractor tools for robustness testing
  • 250 tool-beneficial tasks → 69% of benchmark tasks benefit from MCP tools
  • Multi-round tool invocation possible, posing real decision-making challenges
  • MCP tools boost model accuracy & efficiency — e.g., OpenAI o3: 8.3% → 20.4% (15 steps)
  • Highest observed Tool Invocation Rate (TIR) = 36.3% (Claude-4-Sonnet, 50 steps) → indicating ample room for improvement
  • MCP tools improve agent metrics
  • Higher tool invocation correlates with higher accuracy
  • Combining tools introduces significant challenges

Architecture Overview

OSWorld-MCP Architecture
Figure: OSWorld-MCP evaluation framework integrating GUI actions and MCP tool invocations.


⚙️ Installation & Usage

1️⃣ Preparation: Code Setup

# Clone OSWorld base repo
git clone https://github.com/xlang-ai/OSWorld.git

# Clone OSWorld-MCP
git clone https://github.com/X-PLUG/OSWorld-MCP.git

Integrate OSWorld-MCP files into OSWorld to enable MCP support.


2️⃣ Preparation: Docker Environment

  1. Copy MCP files into /home inside Docker:
/home/
└── mcp_server/
└── osworld_mcp_client.py
  1. Install dependencies:
pip install -r requirements.txt
  1. Install Node.js
  2. Launch MCP server:
cd mcp_server
bash debug_server.sh

A successful launch opens the local MCP debug UI in your browser.


3️⃣ Running Evaluation

Example: Evaluate Claude 4 Sonnet (15 steps):

python run_multienv_e2e.py \
    --api_url <your_api_url> \
    --api_key <your_api_key> \
    --model 'claude-sonnet-4-20250514-thinking' \
    --test_all_meta_path 'evaluation_examples/test_all.json' \
    --num_envs 1 \
    --action_space mcp \
    --max_steps 15 \
    --max_trajectory_length 15

📐 Key Metrics

  1. Task Accuracy (Acc) — % of tasks successfully completed.
  2. Tool Invocation Rate (TIR) — correct decisions to use a tool or not.
  3. Average Completion Steps (ACS) — average number of actions per completed task.

📊 Leaderboard (Sorted by Accuracy)

🔗 Live Leaderboard: osworld-mcp.github.io

Max Steps: 15

Model / Agent Acc TIR ACS
Agent-S2.5 42.1 30.0 10.0
Claude-4-Sonnet 35.3 30.0 10.4
Seed1.5-VL 32.0 25.1 10.2
Qwen3-VL 31.3 24.5 10.5
Gemini-2.5-Pro 20.5 16.8 11.4
OpenAI o3 20.4 16.7 11.6
Qwen2.5-VL 15.8 13.1 13.5

Max Steps: 50

Model / Agent Acc TIR ACS
Agent-S2.5 49.5 35.3 17.0
Claude-4-Sonnet 43.3 36.6 20.1
Qwen3-VL 39.1 29.5 21.1
Seed1.5-VL 38.4 29.0 23.0
Gemini-2.5-Pro 27.2 21.5 29.7
OpenAI o3 25.2 21.0 32.1
Qwen2.5-VL 14.8 10.9 37.2

📚 Citation

@article{jia2025osworldmcp,
  title={OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents},
  author={Jia, Hongrui and Liao, Jitong and Zhang, Xi and Xu, Haiyang and Xie, Tianbao and Jiang, Chaoya and Yan, Ming and Liu, Si and Ye, Wei and Huang, Fei},
  year={2025},
  journal={arXiv preprint arXiv:2510.24563}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iflow_mcp_x_plug_osworld_mcp-0.1.0.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iflow_mcp_x_plug_osworld_mcp-0.1.0-py3-none-any.whl (72.2 kB view details)

Uploaded Python 3

File details

Details for the file iflow_mcp_x_plug_osworld_mcp-0.1.0.tar.gz.

File metadata

  • Download URL: iflow_mcp_x_plug_osworld_mcp-0.1.0.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_x_plug_osworld_mcp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5abe36e5ac84af8598d3db4d5f1dfe19711e2430aa42413eb2dc0db3a7e91f90
MD5 ccc43b0a6a48508f2e824e2f352dc2a2
BLAKE2b-256 aba3a5dfb619410e8afc119fae77a2679832210b049fc1c8c1f2ccee70d7bab2

See more details on using hashes here.

File details

Details for the file iflow_mcp_x_plug_osworld_mcp-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: iflow_mcp_x_plug_osworld_mcp-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 72.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_x_plug_osworld_mcp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7c6d018d3545126aaf38da9b180bc5f07e60d8ccdf41f0175cb3eb43077218bd
MD5 2dd6dc3d8653289ba2407ee366b10aac
BLAKE2b-256 964b1c693817f81bb20a156eaa0ced43510729c3462e4e39e3b618cb0a49a1e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page