Skip to main content

Python SDK for ModelScout - LLM Benchmarking and Evaluation

Project description

ModelScout Python SDK

Find the best LLM for your product. Run benchmarks across multiple models on your own data to see which performs best for quality, cost, and latency.

Installation

pip install modelscout-sdk

Quick Start

from modelscout import Benchmark

# Set MODELSCOUT_API_KEY in your environment, or pass api_key="ms_..."
# Models are selected at checkout and locked to your purchase
results = Benchmark().run(
    purchased_benchmark_id="pb_...",  # from dashboard checkout
    prompts=["Write a SQL query to find active users", "Explain quantum computing"],
)

print(results.best_model_for("quality"))  # Best quality model
print(results.best_model_for("cost"))     # Cheapest model

Features

Benchmarking

Compare LLMs side-by-side on your evaluation data. Get quality scores, cost analysis, latency metrics, and statistical significance.

Data Generation

Need synthetic test data? Generate evaluation datasets from the dashboard — describe your use case and get representative prompts in minutes.

Dataset Upload

Upload your own evaluation data:

dataset_id = benchmark.upload_dataset(
    name="My Test Data",
    samples=[
        {"input": "What is machine learning?"},
        {"input": "Explain neural networks"},
    ],
)

Agentic Evaluation

Test tool-calling capabilities with multi-turn evaluation (SDK-only):

from modelscout import Benchmark, AgenticConfig, ToolDefinition

def my_search_function(query: str) -> str:
    return f"Results for: {query}"

config = AgenticConfig(tools=[
    ToolDefinition(
        name="search",
        description="Search the web",
        parameters={"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]},
        implementation=my_search_function,
    )
])

# Models are locked to your purchase — selected at checkout
results = Benchmark().run(
    name="Agent Eval",
    purchased_benchmark_id="pb_...",
    prompts=["Find information about quantum computing"],
    agentic_config=config,
)

Execution Graders

Pass a grader= to Benchmark.run() to add a deterministic pass/fail layer that runs on your machine, against your ground truth. LLM judges score style well but can miss correctness on structured outputs like SQL, JSON, or code — graders fill that gap. The verdict is both passed to the judge as primary evidence of correctness and surfaced independently as execution_pass_rate_by_model.

Pre-built graders: SQLGrader, JSONSchemaGrader, NumericGrader. Subclass Grader for anything else.

from modelscout import Benchmark
from modelscout.graders import SQLGrader

grader = SQLGrader(db_path="./warehouse.sqlite")

result = Benchmark(api_key="ms_...").run(
    name="SQL Benchmark",
    purchased_benchmark_id="pb_...",
    samples=[
        {
            "input": "Which products sold the most last quarter?",
            # expected_output is the canonical row set as JSON
            "expected_output": '[["Widget", 1200], ["Gadget", 900]]',
        },
    ],
    grader=grader,
)

print(result.execution_pass_rate_by_model)
# {'anthropic/claude-opus-4-7': 48.0, 'openai/gpt-5.4': 28.0}

Full guide: Graders.

Supported Models

30 models across 10 providers:

Provider Models
OpenAI gpt-5.4, gpt-5.4-mini, gpt-5.4-nano, gpt-5-mini, gpt-5-nano, gpt-oss-120b, gpt-oss-20b
Anthropic claude-opus-4-7, claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5-20251001
Google gemini-3.1-pro, gemini-3-flash, gemini-3.1-flash-lite, gemini-2.5-flash-lite, gemma-3-27b-it
DeepSeek deepseek-v3.2, deepseek-v3.2-speciale, deepseek-r1
Qwen qwen3.5-397b-a17b, qwen3.5-flash-02-23, qwen3-235b-a22b
Meta llama-4-maverick, llama-4-scout
Mistral mistral-large-2512, mistral-small-2603
xAI grok-4, grok-4.1-fast
Zhipu glm-5, glm-5-turbo, glm-5.1
Moonshot kimi-k2.5

Pricing

Pay-as-you-go: Purchase benchmarks from the dashboard. Price depends on selected models, sample count, and judge tier. Starting from ~$4.50.

Launch Discount: 10% off all benchmarks during our launch period. Applied automatically at checkout.

Documentation

Full documentation: modelscout.co/docs/sdk


License

Proprietary. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelscout_sdk-0.2.0.tar.gz (78.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

modelscout_sdk-0.2.0-py3-none-any.whl (80.6 kB view details)

Uploaded Python 3

File details

Details for the file modelscout_sdk-0.2.0.tar.gz.

File metadata

  • Download URL: modelscout_sdk-0.2.0.tar.gz
  • Upload date:
  • Size: 78.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for modelscout_sdk-0.2.0.tar.gz
Algorithm Hash digest
SHA256 efc8b178bdadaf49fcef0c042f58e681c9db4e515f5d07709c6cb6b502258387
MD5 879b4852961e84be53c72bc0929b8041
BLAKE2b-256 17f8e5dfef7e521b4bc06c8867661301ce514f35c69b462cc1fd63837ecb7441

See more details on using hashes here.

File details

Details for the file modelscout_sdk-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: modelscout_sdk-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 80.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for modelscout_sdk-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c546a267267a0e51230ad1148db9087629aa5b8b8a54b2eb850eb808d36a8966
MD5 df86c73fdbb99ad2b316f40f9d591f7c
BLAKE2b-256 2f0d39b8a928ff890a50e5dd3ac9973a15bfa4d9822a9739e95d5924d71e130b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page