Skip to main content

Benchmark tool for testing LLM models with Composio's Tool Router

Project description

Composio Tool Router Simulator

PyPI version Python 3.10+ License: MIT

A benchmarking tool that tests how different LLM models perform when using Composio's Tool Router (the system powering Rube). Compare models across accuracy, speed, cost, and tool selection quality.

What is Tool Router?

Tool Router is Composio's agentic system that:

  • Searches 10,000+ tools automatically based on natural language
  • Plans multi-step workflows
  • Executes tools with proper authentication
  • Powers the Rube MCP server

Installation

pip install composio-tool-router-sim

Quick Start

# Set your Composio API key
export COMPOSIO_API_KEY="your-api-key"

# Run a single task across all models
tool-router-sim run --task "List my 5 most recent unread emails"

# Dry run (no API calls - for testing)
tool-router-sim run --task "List my unread emails" --dry-run

# Run a benchmark suite
tool-router-sim benchmark --suite simple_tasks

# Interactive mode
tool-router-sim interactive

Supported Models

Via Vercel AI Gateway

Model Description
claude-sonnet-4 Anthropic's balanced model with excellent tool use
claude-haiku-4.5 Fast and cost-effective Claude model
gpt-4o OpenAI's flagship multimodal model
gpt-4o-mini Cost-effective OpenAI model
gemini-2.0-flash Google's fast multimodal model

Via Groq (Fast Inference)

Model Description
llama-3.3-70b Meta's large versatile model
llama-3.1-8b Fast, small Llama model
mixtral-8x7b Mistral's mixture of experts
gemma2-9b Google's instruction-tuned Gemma

CLI Commands

Run a Single Task

# Test all models
tool-router-sim run --task "Send a Slack message to #general"

# Test specific models
tool-router-sim run --task "..." --models claude-sonnet-4,gpt-4o,llama-3.3-70b

# Export results
tool-router-sim run --task "..." --output results.json

Run Benchmark Suites

# Built-in suites
tool-router-sim benchmark --suite simple_tasks
tool-router-sim benchmark --suite multi_step_tasks
tool-router-sim benchmark --suite edge_cases

# Custom suite from JSON file
tool-router-sim benchmark --suite my_tasks.json

# Save results to directory
tool-router-sim benchmark --suite simple_tasks --output-dir ./results

Other Commands

# List available models with pricing
tool-router-sim list-models

# List built-in benchmark tasks
tool-router-sim list-tasks

# Interactive mode
tool-router-sim interactive

Sample Output

╔══════════════════════════════════════════════════════════════════════════════╗
║              COMPOSIO TOOL ROUTER - MODEL BENCHMARK RESULTS                  ║
║              Task: "List my 5 most recent unread emails"                     ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ Model                    │ Tool Selection │ Execution │ Speed  │ Cost   │ Grade ║
╠══════════════════════════╪════════════════╪═══════════╪════════╪════════╪═══════╣
║ claude-sonnet-4          │ 100%           │ 100%      │ 1.8s   │ $0.004 │ A+    ║
║ gpt-4o                   │ 100%           │ 100%      │ 2.1s   │ $0.005 │ A+    ║
║ llama-3.3-70b            │ 95%            │ 90%       │ 0.4s   │ $0.001 │ A     ║
║ gpt-4o-mini              │ 90%            │ 85%       │ 1.2s   │ $0.001 │ B+    ║
║ llama-3.1-8b             │ 70%            │ 60%       │ 0.2s   │ $0.0002│ C     ║
╚══════════════════════════════════════════════════════════════════════════════╝

🏆 BEST TOOL SELECTION: claude-sonnet-4, gpt-4o (100%)
⚡ FASTEST: llama-3.1-8b (0.2s)
💰 CHEAPEST: llama-3.1-8b ($0.0002)
⭐ BEST OVERALL: claude-sonnet-4

Evaluation Metrics

Tool Selection Score

  • Found relevant tools via search
  • Selected correct tool for the task
  • Avoided irrelevant tools
  • Proper search query formulation

Execution Score

  • Correct parameters passed
  • Successful execution
  • Complete results returned
  • Efficient (minimal API calls)

Planning Score

  • Logical step order (search → connect → execute)
  • Prerequisites handled
  • Minimal steps taken

Creating Custom Benchmark Tasks

Create a JSON file with your tasks:

{
  "name": "My Custom Tasks",
  "description": "Custom benchmark suite",
  "tasks": [
    {
      "id": "custom_gmail",
      "task": "List my 5 most recent unread emails",
      "expected_tools": ["GMAIL_FETCH_EMAILS"],
      "expected_params": {
        "max_results": 5,
        "query": "is:unread"
      },
      "required_keywords": ["gmail", "email", "unread"],
      "difficulty": "easy"
    }
  ]
}

Then run:

tool-router-sim benchmark --suite my_tasks.json

Python API

from tool_router_sim.composio_client import create_client
from tool_router_sim.simulator.vercel_runner import create_vercel_runner
from tool_router_sim.evaluator.scorer import CompositeScorer

# Create client
client = create_client(dry_run=False)

# Create runner for a specific model
runner = create_vercel_runner(client, "claude-sonnet-4")

# Run simulation
result = runner.simulate("List my unread emails")

# Score the result
scorer = CompositeScorer(expected_tools=["GMAIL_FETCH_EMAILS"])
score = scorer.score(result)

print(f"Grade: {score.grade}")
print(f"Tool Selection: {score.tool_selection.score * 100}%")
print(f"Execution: {score.execution.score * 100}%")

Environment Variables

Variable Description
COMPOSIO_API_KEY Your Composio API key (required)

Development

# Clone the repo
git clone https://github.com/composio/tool-router-simulator
cd tool-router-simulator

# Install in dev mode
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black tool_router_sim/
ruff check tool_router_sim/

License

MIT License - see LICENSE for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

composio_tool_router_sim-0.1.0-py3-none-any.whl (37.0 kB view details)

Uploaded Python 3

File details

Details for the file composio_tool_router_sim-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for composio_tool_router_sim-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 02423800a2ed4ddaa6c651f38d11abc758028f0892c07a557ced891f4f36b352
MD5 9f8f6fe5feb7adaa1817e0fc25467b87
BLAKE2b-256 26b86ce6bbdfcac8b3a28ce277579bee6ef350148be2b0ca09f7322dfc970781

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page