Python SDK for ModelScout - LLM Benchmarking and Evaluation

These details have not been verified by PyPI

Project links

Project description

ModelScout Python SDK

Find the best LLM for your product. Run benchmarks across multiple models on your own data to see which performs best for quality, cost, and latency.

Installation

pip install modelscout-sdk

Quick Start

from modelscout import Benchmark

# Set MODELSCOUT_API_KEY in your environment, or pass api_key="ms_..."
# Models are selected at checkout and locked to your purchase
results = Benchmark().run(
    purchased_benchmark_id="pb_...",  # from dashboard checkout
    prompts=["Write a SQL query to find active users", "Explain quantum computing"],
)

print(results.best_model_for("quality"))  # Best quality model
print(results.best_model_for("cost"))     # Cheapest model

Features

Benchmarking

Compare LLMs side-by-side on your evaluation data. Get quality scores, cost analysis, latency metrics, and statistical significance.

Data Generation

Need synthetic test data? Generate evaluation datasets from the dashboard — describe your use case and get representative prompts in minutes.

Dataset Upload

Upload your own evaluation data:

dataset_id = benchmark.upload_dataset(
    name="My Test Data",
    samples=[
        {"input": "What is machine learning?"},
        {"input": "Explain neural networks"},
    ],
)

Agentic Evaluation

Test tool-calling capabilities with multi-turn evaluation (SDK-only):

from modelscout import Benchmark, AgenticConfig, ToolDefinition

def my_search_function(query: str) -> str:
    return f"Results for: {query}"

config = AgenticConfig(tools=[
    ToolDefinition(
        name="search",
        description="Search the web",
        parameters={"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]},
        implementation=my_search_function,
    )
])

# Models are locked to your purchase — selected at checkout
results = Benchmark().run(
    name="Agent Eval",
    purchased_benchmark_id="pb_...",
    prompts=["Find information about quantum computing"],
    agentic_config=config,
)

Execution Graders

Pass a grader= to Benchmark.run() to add a deterministic pass/fail layer that runs on your machine, against your ground truth. LLM judges score style well but can miss correctness on structured outputs like SQL, JSON, or code — graders fill that gap. The verdict is both passed to the judge as primary evidence of correctness and surfaced independently as execution_pass_rate_by_model.

Pre-built graders: SQLGrader, JSONSchemaGrader, NumericGrader. Subclass Grader for anything else.

from modelscout import Benchmark
from modelscout.graders import SQLGrader

grader = SQLGrader(db_path="./warehouse.sqlite")

result = Benchmark(api_key="ms_...").run(
    name="SQL Benchmark",
    purchased_benchmark_id="pb_...",
    samples=[
        {
            "input": "Which products sold the most last quarter?",
            # expected_output is the canonical row set as JSON
            "expected_output": '[["Widget", 1200], ["Gadget", 900]]',
        },
    ],
    grader=grader,
)

print(result.execution_pass_rate_by_model)
# {'anthropic/claude-opus-4-7': 48.0, 'openai/gpt-5.4': 28.0}

Full guide: Graders.

Supported Models

30 models across 10 providers:

Provider	Models
OpenAI	gpt-5.4, gpt-5.4-mini, gpt-5.4-nano, gpt-5-mini, gpt-5-nano, gpt-oss-120b, gpt-oss-20b
Anthropic	claude-opus-4-7, claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5-20251001
Google	gemini-3.1-pro, gemini-3-flash, gemini-3.1-flash-lite, gemini-2.5-flash-lite, gemma-3-27b-it
DeepSeek	deepseek-v3.2, deepseek-v3.2-speciale, deepseek-r1
Qwen	qwen3.5-397b-a17b, qwen3.5-flash-02-23, qwen3-235b-a22b
Meta	llama-4-maverick, llama-4-scout
Mistral	mistral-large-2512, mistral-small-2603
xAI	grok-4, grok-4.1-fast
Zhipu	glm-5, glm-5-turbo, glm-5.1
Moonshot	kimi-k2.5

Pricing

Pay-as-you-go: Purchase benchmarks from the dashboard. Price depends on selected models, sample count, and judge tier. Starting from ~$4.50.

Launch Discount: 10% off all benchmarks during our launch period. Applied automatically at checkout.

Documentation

Full documentation: modelscout.co/docs/sdk

License

Proprietary. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Apr 19, 2026

0.1.2

Mar 29, 2026

0.1.1

Mar 27, 2026

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelscout_sdk-0.2.0.tar.gz (78.6 kB view details)

Uploaded Apr 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

modelscout_sdk-0.2.0-py3-none-any.whl (80.6 kB view details)

Uploaded Apr 19, 2026 Python 3

File details

Details for the file modelscout_sdk-0.2.0.tar.gz.

File metadata

Download URL: modelscout_sdk-0.2.0.tar.gz
Upload date: Apr 19, 2026
Size: 78.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for modelscout_sdk-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`efc8b178bdadaf49fcef0c042f58e681c9db4e515f5d07709c6cb6b502258387`
MD5	`879b4852961e84be53c72bc0929b8041`
BLAKE2b-256	`17f8e5dfef7e521b4bc06c8867661301ce514f35c69b462cc1fd63837ecb7441`

See more details on using hashes here.

File details

Details for the file modelscout_sdk-0.2.0-py3-none-any.whl.

File metadata

Download URL: modelscout_sdk-0.2.0-py3-none-any.whl
Upload date: Apr 19, 2026
Size: 80.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for modelscout_sdk-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c546a267267a0e51230ad1148db9087629aa5b8b8a54b2eb850eb808d36a8966`
MD5	`df86c73fdbb99ad2b316f40f9d591f7c`
BLAKE2b-256	`2f0d39b8a928ff890a50e5dd3ac9973a15bfa4d9822a9739e95d5924d71e130b`

See more details on using hashes here.

modelscout-sdk 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ModelScout Python SDK

Installation

Quick Start

Features

Benchmarking

Data Generation

Dataset Upload

Agentic Evaluation

Execution Graders

Supported Models

Pricing

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes