Skip to main content

OSWorld-MCP: A comprehensive MCP server for computer-use agents with 158 validated tools

Project description

OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents

🔔 Updates

2025-10-28: We released our paper and project page! 🎉

📄 Read the Paper  |  🌐 Visit the Project Page


📑 Overview & Key Highlights

OSWorld-MCP is a comprehensive and fair benchmark for evaluating computer-use agents in real-world scenarios.
It jointly measures Model Context Protocol (MCP) tool invocation capabilities, graphical user interface (GUI) operation skills, and decision-making performance.
Designed as an extension of OSWorld, it significantly improves realism, balance, and comparability in evaluation.

Key Features & Findings

  • 158 validated MCP tools, spanning 7 common applications (LibreOffice Writer, Calc, Impress, VS Code, Google Chrome, VLC, OS utilities). Among them, 25 distractor tools for robustness testing
  • 250 tool-beneficial tasks → 69% of benchmark tasks benefit from MCP tools
  • Multi-round tool invocation possible, posing real decision-making challenges
  • MCP tools boost model accuracy & efficiency — e.g., OpenAI o3: 8.3% → 20.4% (15 steps)
  • Highest observed Tool Invocation Rate (TIR) = 36.3% (Claude-4-Sonnet, 50 steps) → indicating ample room for improvement
  • MCP tools improve agent metrics
  • Higher tool invocation correlates with higher accuracy
  • Combining tools introduces significant challenges

Architecture Overview

OSWorld-MCP Architecture
Figure: OSWorld-MCP evaluation framework integrating GUI actions and MCP tool invocations.


⚙️ Installation & Usage

1️⃣ Preparation: Code Setup

# Clone OSWorld base repo
git clone https://github.com/xlang-ai/OSWorld.git

# Clone OSWorld-MCP
git clone https://github.com/X-PLUG/OSWorld-MCP.git

Integrate OSWorld-MCP files into OSWorld to enable MCP support.


2️⃣ Preparation: Docker Environment

  1. Copy MCP files into /home inside Docker:
/home/
└── mcp_server/
└── osworld_mcp_client.py
  1. Install dependencies:
pip install -r requirements.txt
  1. Install Node.js
  2. Launch MCP server:
cd mcp_server
bash debug_server.sh

A successful launch opens the local MCP debug UI in your browser.


3️⃣ Running Evaluation

Example: Evaluate Claude 4 Sonnet (15 steps):

python run_multienv_e2e.py \
    --api_url <your_api_url> \
    --api_key <your_api_key> \
    --model 'claude-sonnet-4-20250514-thinking' \
    --test_all_meta_path 'evaluation_examples/test_all.json' \
    --num_envs 1 \
    --action_space mcp \
    --max_steps 15 \
    --max_trajectory_length 15

📐 Key Metrics

  1. Task Accuracy (Acc) — % of tasks successfully completed.
  2. Tool Invocation Rate (TIR) — correct decisions to use a tool or not.
  3. Average Completion Steps (ACS) — average number of actions per completed task.

📊 Leaderboard (Sorted by Accuracy)

🔗 Live Leaderboard: osworld-mcp.github.io

Max Steps: 15

Model / Agent Acc TIR ACS
Agent-S2.5 42.1 30.0 10.0
Claude-4-Sonnet 35.3 30.0 10.4
Seed1.5-VL 32.0 25.1 10.2
Qwen3-VL 31.3 24.5 10.5
Gemini-2.5-Pro 20.5 16.8 11.4
OpenAI o3 20.4 16.7 11.6
Qwen2.5-VL 15.8 13.1 13.5

Max Steps: 50

Model / Agent Acc TIR ACS
Agent-S2.5 49.5 35.3 17.0
Claude-4-Sonnet 43.3 36.6 20.1
Qwen3-VL 39.1 29.5 21.1
Seed1.5-VL 38.4 29.0 23.0
Gemini-2.5-Pro 27.2 21.5 29.7
OpenAI o3 25.2 21.0 32.1
Qwen2.5-VL 14.8 10.9 37.2

📚 Citation

@article{jia2025osworldmcp,
  title={OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents},
  author={Jia, Hongrui and Liao, Jitong and Zhang, Xi and Xu, Haiyang and Xie, Tianbao and Jiang, Chaoya and Yan, Ming and Liu, Si and Ye, Wei and Huang, Fei},
  year={2025},
  journal={arXiv preprint arXiv:2510.24563}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iflow_mcp_x_plug_osworld_mcp-0.1.1.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iflow_mcp_x_plug_osworld_mcp-0.1.1-py3-none-any.whl (72.2 kB view details)

Uploaded Python 3

File details

Details for the file iflow_mcp_x_plug_osworld_mcp-0.1.1.tar.gz.

File metadata

  • Download URL: iflow_mcp_x_plug_osworld_mcp-0.1.1.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_x_plug_osworld_mcp-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c9be6547d6fe6525763cb15ae3421ce98ae1d8a53175f9505516e5e928d1a8b4
MD5 066c38862efb1d248a50e35a1bed2ec8
BLAKE2b-256 6d6587e9e8804b4f797b91d01b3bd007c7fe0516ee98aea164cb96927c98b57c

See more details on using hashes here.

File details

Details for the file iflow_mcp_x_plug_osworld_mcp-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: iflow_mcp_x_plug_osworld_mcp-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 72.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_x_plug_osworld_mcp-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c7b25414e42a337202de768c551fffad6f0d092ec857004a85c16073f9a2e9f6
MD5 c17f243c90be7a4bf920d5438bc25f39
BLAKE2b-256 9585bbb9a6d82902f568c099a64197c0e1ddbb2b89360f9b7d593253666f1dfa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page