Benchmark tool for testing LLM models with Composio's Tool Router

These details have not been verified by PyPI

Project links

Project description

Composio Tool Router Simulator

A benchmarking tool that tests how different LLM models perform when using Composio's Tool Router (the system powering Rube). Compare models across accuracy, speed, cost, and tool selection quality.

What is Tool Router?

Tool Router is Composio's agentic system that:

Searches 10,000+ tools automatically based on natural language
Plans multi-step workflows
Executes tools with proper authentication
Powers the Rube MCP server

Installation

pip install composio-tool-router-sim

Quick Start

# Set your Composio API key
export COMPOSIO_API_KEY="your-api-key"

# Run a single task across all models
tool-router-sim run --task "List my 5 most recent unread emails"

# Dry run (no API calls - for testing)
tool-router-sim run --task "List my unread emails" --dry-run

# Run a benchmark suite
tool-router-sim benchmark --suite simple_tasks

# Interactive mode
tool-router-sim interactive

Supported Models

Via Vercel AI Gateway

Model	Description
`claude-sonnet-4`	Anthropic's balanced model with excellent tool use
`claude-haiku-4.5`	Fast and cost-effective Claude model
`gpt-4o`	OpenAI's flagship multimodal model
`gpt-4o-mini`	Cost-effective OpenAI model
`gemini-2.0-flash`	Google's fast multimodal model

Via Groq (Fast Inference)

Model	Description
`llama-3.3-70b`	Meta's large versatile model
`llama-3.1-8b`	Fast, small Llama model
`mixtral-8x7b`	Mistral's mixture of experts
`gemma2-9b`	Google's instruction-tuned Gemma

CLI Commands

Run a Single Task

# Test all models
tool-router-sim run --task "Send a Slack message to #general"

# Test specific models
tool-router-sim run --task "..." --models claude-sonnet-4,gpt-4o,llama-3.3-70b

# Export results
tool-router-sim run --task "..." --output results.json

Run Benchmark Suites

# Built-in suites
tool-router-sim benchmark --suite simple_tasks
tool-router-sim benchmark --suite multi_step_tasks
tool-router-sim benchmark --suite edge_cases

# Custom suite from JSON file
tool-router-sim benchmark --suite my_tasks.json

# Save results to directory
tool-router-sim benchmark --suite simple_tasks --output-dir ./results

Other Commands

# List available models with pricing
tool-router-sim list-models

# List built-in benchmark tasks
tool-router-sim list-tasks

# Interactive mode
tool-router-sim interactive

Sample Output

╔══════════════════════════════════════════════════════════════════════════════╗
║              COMPOSIO TOOL ROUTER - MODEL BENCHMARK RESULTS                  ║
║              Task: "List my 5 most recent unread emails"                     ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ Model                    │ Tool Selection │ Execution │ Speed  │ Cost   │ Grade ║
╠══════════════════════════╪════════════════╪═══════════╪════════╪════════╪═══════╣
║ claude-sonnet-4          │ 100%           │ 100%      │ 1.8s   │ $0.004 │ A+    ║
║ gpt-4o                   │ 100%           │ 100%      │ 2.1s   │ $0.005 │ A+    ║
║ llama-3.3-70b            │ 95%            │ 90%       │ 0.4s   │ $0.001 │ A     ║
║ gpt-4o-mini              │ 90%            │ 85%       │ 1.2s   │ $0.001 │ B+    ║
║ llama-3.1-8b             │ 70%            │ 60%       │ 0.2s   │ $0.0002│ C     ║
╚══════════════════════════════════════════════════════════════════════════════╝

🏆 BEST TOOL SELECTION: claude-sonnet-4, gpt-4o (100%)
⚡ FASTEST: llama-3.1-8b (0.2s)
💰 CHEAPEST: llama-3.1-8b ($0.0002)
⭐ BEST OVERALL: claude-sonnet-4

Evaluation Metrics

Tool Selection Score

Found relevant tools via search
Selected correct tool for the task
Avoided irrelevant tools
Proper search query formulation

Execution Score

Correct parameters passed
Successful execution
Complete results returned
Efficient (minimal API calls)

Planning Score

Logical step order (search → connect → execute)
Prerequisites handled
Minimal steps taken

Creating Custom Benchmark Tasks

Create a JSON file with your tasks:

{
  "name": "My Custom Tasks",
  "description": "Custom benchmark suite",
  "tasks": [
    {
      "id": "custom_gmail",
      "task": "List my 5 most recent unread emails",
      "expected_tools": ["GMAIL_FETCH_EMAILS"],
      "expected_params": {
        "max_results": 5,
        "query": "is:unread"
      },
      "required_keywords": ["gmail", "email", "unread"],
      "difficulty": "easy"
    }
  ]
}

Then run:

tool-router-sim benchmark --suite my_tasks.json

Python API

from tool_router_sim.composio_client import create_client
from tool_router_sim.simulator.vercel_runner import create_vercel_runner
from tool_router_sim.evaluator.scorer import CompositeScorer

# Create client
client = create_client(dry_run=False)

# Create runner for a specific model
runner = create_vercel_runner(client, "claude-sonnet-4")

# Run simulation
result = runner.simulate("List my unread emails")

# Score the result
scorer = CompositeScorer(expected_tools=["GMAIL_FETCH_EMAILS"])
score = scorer.score(result)

print(f"Grade: {score.grade}")
print(f"Tool Selection: {score.tool_selection.score * 100}%")
print(f"Execution: {score.execution.score * 100}%")

Environment Variables

Variable	Description
`COMPOSIO_API_KEY`	Your Composio API key (required)

Development

# Clone the repo
git clone https://github.com/composio/tool-router-simulator
cd tool-router-simulator

# Install in dev mode
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black tool_router_sim/
ruff check tool_router_sim/

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

composio_tool_router_sim-0.1.0-py3-none-any.whl (37.0 kB view details)

Uploaded Feb 4, 2026 Python 3

File details

Details for the file composio_tool_router_sim-0.1.0-py3-none-any.whl.

File metadata

Download URL: composio_tool_router_sim-0.1.0-py3-none-any.whl
Upload date: Feb 4, 2026
Size: 37.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for composio_tool_router_sim-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`02423800a2ed4ddaa6c651f38d11abc758028f0892c07a557ced891f4f36b352`
MD5	`9f8f6fe5feb7adaa1817e0fc25467b87`
BLAKE2b-256	`26b86ce6bbdfcac8b3a28ce277579bee6ef350148be2b0ca09f7322dfc970781`

See more details on using hashes here.

composio-tool-router-sim 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Composio Tool Router Simulator

What is Tool Router?

Installation

Quick Start

Supported Models

Via Vercel AI Gateway

Via Groq (Fast Inference)

CLI Commands

Run a Single Task

Run Benchmark Suites

Other Commands

Sample Output

Evaluation Metrics

Tool Selection Score

Execution Score

Planning Score

Creating Custom Benchmark Tasks

Python API

Environment Variables

Development

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes