Skip to main content

MCP Server for LLM comparison, benchmarks, and pricing — find the best model for any task

Project description

LLM Benchmark MCP Server

MCP server that gives AI agents access to LLM benchmark data, pricing comparisons, and model recommendations.

Features

  • compare_models — Side-by-side benchmark comparison of LLMs (MMLU, HumanEval, MATH, GPQA, ARC, HellaSwag)
  • get_model_details — Detailed info about a specific model including strengths/weaknesses
  • recommend_model — Get the best model recommendation for your task and budget
  • list_top_models — Top models ranked by category (coding, math, reasoning, chat)
  • get_pricing — Pricing comparison via OpenRouter API

Supported Models

GPT-4o, GPT-4o-mini, GPT-4 Turbo, o1, o3-mini, Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus, Gemini 2.0 Flash, Gemini 2.0 Pro, Gemini 1.5 Pro, Llama 3.1 (8B/70B/405B), Llama 3.3 70B, Mistral Large, Mistral Small, Mixtral 8x22B, DeepSeek V3, DeepSeek R1, Qwen 2.5 72B

Installation

pip install llm-benchmark-mcp-server

Usage with Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "llm-benchmark": {
      "command": "benchmark-server"
    }
  }
}

Or via uvx (no install needed):

{
  "mcpServers": {
    "llm-benchmark": {
      "command": "uvx",
      "args": ["llm-benchmark-mcp-server"]
    }
  }
}

Example Queries

  • "Compare GPT-4o vs Claude 3.5 Sonnet vs Gemini 2.0 Pro"
  • "Which model is best for coding on a low budget?"
  • "Show me the top 10 models for math"
  • "What does GPT-4o cost compared to Claude?"
  • "Give me details about DeepSeek R1"

Data Sources

  • Benchmarks: Hardcoded from official papers and public leaderboards (MMLU, HumanEval, MATH, GPQA, ARC-Challenge, HellaSwag)
  • Pricing: Live data from OpenRouter API
  • Arena Rankings: Chatbot Arena Leaderboard (when available)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_benchmark_mcp_server-0.1.0.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_benchmark_mcp_server-0.1.0-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file llm_benchmark_mcp_server-0.1.0.tar.gz.

File metadata

File hashes

Hashes for llm_benchmark_mcp_server-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ec988809043ad7eb093471ef5c11396fd05c07a08baa320c550f6f5d6a3a7155
MD5 9adb12a90ccac25f117d6933be37ac03
BLAKE2b-256 01be2cd9a39b0a0bf5d672e2c06e62017b602eb0b618b6a0fe2ba7518a0a50f5

See more details on using hashes here.

File details

Details for the file llm_benchmark_mcp_server-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_benchmark_mcp_server-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3da7aa55b3c5249cd487b3ce05b705c5718a281cbc061fd48df4ac1a96f0be98
MD5 04a19eda0955cb3c59a615e5b32bde52
BLAKE2b-256 bad3638a41393384487cf6dd23511b75bda33c24c3076e9ba3e0eccb2cb4fe51

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page