Skip to main content

OSWorld-MCP: A comprehensive MCP server for computer-use agents with 158 validated tools

Project description

OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents

🔔 Updates

2025-10-28: We released our paper and project page! 🎉

📄 Read the Paper  |  🌐 Visit the Project Page


📑 Overview & Key Highlights

OSWorld-MCP is a comprehensive and fair benchmark for evaluating computer-use agents in real-world scenarios.
It jointly measures Model Context Protocol (MCP) tool invocation capabilities, graphical user interface (GUI) operation skills, and decision-making performance.
Designed as an extension of OSWorld, it significantly improves realism, balance, and comparability in evaluation.

Key Features & Findings

  • 158 validated MCP tools, spanning 7 common applications (LibreOffice Writer, Calc, Impress, VS Code, Google Chrome, VLC, OS utilities). Among them, 25 distractor tools for robustness testing
  • 250 tool-beneficial tasks → 69% of benchmark tasks benefit from MCP tools
  • Multi-round tool invocation possible, posing real decision-making challenges
  • MCP tools boost model accuracy & efficiency — e.g., OpenAI o3: 8.3% → 20.4% (15 steps)
  • Highest observed Tool Invocation Rate (TIR) = 36.3% (Claude-4-Sonnet, 50 steps) → indicating ample room for improvement
  • MCP tools improve agent metrics
  • Higher tool invocation correlates with higher accuracy
  • Combining tools introduces significant challenges

Architecture Overview

OSWorld-MCP Architecture
Figure: OSWorld-MCP evaluation framework integrating GUI actions and MCP tool invocations.


⚙️ Installation & Usage

1️⃣ Preparation: Code Setup

# Clone OSWorld base repo
git clone https://github.com/xlang-ai/OSWorld.git

# Clone OSWorld-MCP
git clone https://github.com/X-PLUG/OSWorld-MCP.git

Integrate OSWorld-MCP files into OSWorld to enable MCP support.


2️⃣ Preparation: Docker Environment

  1. Copy MCP files into /home inside Docker:
/home/
└── mcp_server/
└── osworld_mcp_client.py
  1. Install dependencies:
pip install -r requirements.txt
  1. Install Node.js
  2. Launch MCP server:
cd mcp_server
bash debug_server.sh

A successful launch opens the local MCP debug UI in your browser.


3️⃣ Running Evaluation

Example: Evaluate Claude 4 Sonnet (15 steps):

python run_multienv_e2e.py \
    --api_url <your_api_url> \
    --api_key <your_api_key> \
    --model 'claude-sonnet-4-20250514-thinking' \
    --test_all_meta_path 'evaluation_examples/test_all.json' \
    --num_envs 1 \
    --action_space mcp \
    --max_steps 15 \
    --max_trajectory_length 15

📐 Key Metrics

  1. Task Accuracy (Acc) — % of tasks successfully completed.
  2. Tool Invocation Rate (TIR) — correct decisions to use a tool or not.
  3. Average Completion Steps (ACS) — average number of actions per completed task.

📊 Leaderboard (Sorted by Accuracy)

🔗 Live Leaderboard: osworld-mcp.github.io

Max Steps: 15

Model / Agent Acc TIR ACS
Agent-S2.5 42.1 30.0 10.0
Claude-4-Sonnet 35.3 30.0 10.4
Seed1.5-VL 32.0 25.1 10.2
Qwen3-VL 31.3 24.5 10.5
Gemini-2.5-Pro 20.5 16.8 11.4
OpenAI o3 20.4 16.7 11.6
Qwen2.5-VL 15.8 13.1 13.5

Max Steps: 50

Model / Agent Acc TIR ACS
Agent-S2.5 49.5 35.3 17.0
Claude-4-Sonnet 43.3 36.6 20.1
Qwen3-VL 39.1 29.5 21.1
Seed1.5-VL 38.4 29.0 23.0
Gemini-2.5-Pro 27.2 21.5 29.7
OpenAI o3 25.2 21.0 32.1
Qwen2.5-VL 14.8 10.9 37.2

📚 Citation

@article{jia2025osworldmcp,
  title={OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents},
  author={Jia, Hongrui and Liao, Jitong and Zhang, Xi and Xu, Haiyang and Xie, Tianbao and Jiang, Chaoya and Yan, Ming and Liu, Si and Ye, Wei and Huang, Fei},
  year={2025},
  journal={arXiv preprint arXiv:2510.24563}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iflow_mcp_x_plug_osworld_mcp-0.1.2.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iflow_mcp_x_plug_osworld_mcp-0.1.2-py3-none-any.whl (72.3 kB view details)

Uploaded Python 3

File details

Details for the file iflow_mcp_x_plug_osworld_mcp-0.1.2.tar.gz.

File metadata

  • Download URL: iflow_mcp_x_plug_osworld_mcp-0.1.2.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_x_plug_osworld_mcp-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3b5d0534aaf921672a64ad278620fe4ebdeff72f1d9f6ff167b14d3d11a1c3c7
MD5 ee95e5e94c203b0cb62f4b866a934a6e
BLAKE2b-256 82b2e2d5205d42f1229000e8632b180572a4942f6b48d94d84ed049e697ece1b

See more details on using hashes here.

File details

Details for the file iflow_mcp_x_plug_osworld_mcp-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: iflow_mcp_x_plug_osworld_mcp-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 72.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_x_plug_osworld_mcp-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ae161212f94fd72e0d9acb99f16c4ac5434b8567e4a8a2f21de16fa678fff539
MD5 554e2675dfb8a545960a1d72884560ec
BLAKE2b-256 336bbac1a6d3117adf3d14f29d3bac3a31028413c7421bbcee2d89f4842c6296

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page