Python SDK for ModelScout - LLM Benchmarking and Evaluation
Project description
ModelScout Python SDK
Find the best LLM for your product. Run benchmarks across multiple models on your own data to see which performs best for quality, cost, and latency.
Installation
pip install modelscout-sdk
Quick Start
from modelscout import Benchmark
# Set MODELSCOUT_API_KEY in your environment, or pass api_key="ms_..."
# Models are selected at checkout and locked to your purchase
results = Benchmark().run(
purchased_benchmark_id="pb_...", # from dashboard checkout
prompts=["Write a SQL query to find active users", "Explain quantum computing"],
)
print(results.best_model_for("quality")) # Best quality model
print(results.best_model_for("cost")) # Cheapest model
Features
Benchmarking
Compare LLMs side-by-side on your evaluation data. Get quality scores, cost analysis, latency metrics, and statistical significance.
Data Generation
Need synthetic test data? Generate evaluation datasets from the dashboard — describe your use case and get representative prompts in minutes.
Dataset Upload
Upload your own evaluation data:
dataset_id = benchmark.upload_dataset(
name="My Test Data",
samples=[
{"input": "What is machine learning?"},
{"input": "Explain neural networks"},
],
)
Agentic Evaluation
Test tool-calling capabilities with multi-turn evaluation (SDK-only):
from modelscout import Benchmark, AgenticConfig, ToolDefinition
def my_search_function(query: str) -> str:
return f"Results for: {query}"
config = AgenticConfig(tools=[
ToolDefinition(
name="search",
description="Search the web",
parameters={"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]},
implementation=my_search_function,
)
])
# Models are locked to your purchase — selected at checkout
results = Benchmark().run(
name="Agent Eval",
purchased_benchmark_id="pb_...",
prompts=["Find information about quantum computing"],
agentic_config=config,
)
Execution Graders
Pass a grader= to Benchmark.run() to add a deterministic pass/fail layer that runs on your machine, against your ground truth. LLM judges score style well but can miss correctness on structured outputs like SQL, JSON, or code — graders fill that gap. The verdict is both passed to the judge as primary evidence of correctness and surfaced independently as execution_pass_rate_by_model.
Pre-built graders: SQLGrader, JSONSchemaGrader, NumericGrader. Subclass Grader for anything else.
from modelscout import Benchmark
from modelscout.graders import SQLGrader
grader = SQLGrader(db_path="./warehouse.sqlite")
result = Benchmark(api_key="ms_...").run(
name="SQL Benchmark",
purchased_benchmark_id="pb_...",
samples=[
{
"input": "Which products sold the most last quarter?",
# expected_output is the canonical row set as JSON
"expected_output": '[["Widget", 1200], ["Gadget", 900]]',
},
],
grader=grader,
)
print(result.execution_pass_rate_by_model)
# {'anthropic/claude-opus-4-7': 48.0, 'openai/gpt-5.4': 28.0}
Full guide: Graders.
Supported Models
30 models across 10 providers:
| Provider | Models |
|---|---|
| OpenAI | gpt-5.4, gpt-5.4-mini, gpt-5.4-nano, gpt-5-mini, gpt-5-nano, gpt-oss-120b, gpt-oss-20b |
| Anthropic | claude-opus-4-7, claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5-20251001 |
| gemini-3.1-pro, gemini-3-flash, gemini-3.1-flash-lite, gemini-2.5-flash-lite, gemma-3-27b-it | |
| DeepSeek | deepseek-v3.2, deepseek-v3.2-speciale, deepseek-r1 |
| Qwen | qwen3.5-397b-a17b, qwen3.5-flash-02-23, qwen3-235b-a22b |
| Meta | llama-4-maverick, llama-4-scout |
| Mistral | mistral-large-2512, mistral-small-2603 |
| xAI | grok-4, grok-4.1-fast |
| Zhipu | glm-5, glm-5-turbo, glm-5.1 |
| Moonshot | kimi-k2.5 |
Pricing
Pay-as-you-go: Purchase benchmarks from the dashboard. Price depends on selected models, sample count, and judge tier. Starting from ~$4.50.
Launch Discount: 10% off all benchmarks during our launch period. Applied automatically at checkout.
Documentation
Full documentation: modelscout.co/docs/sdk
License
Proprietary. See LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file modelscout_sdk-0.2.0.tar.gz.
File metadata
- Download URL: modelscout_sdk-0.2.0.tar.gz
- Upload date:
- Size: 78.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efc8b178bdadaf49fcef0c042f58e681c9db4e515f5d07709c6cb6b502258387
|
|
| MD5 |
879b4852961e84be53c72bc0929b8041
|
|
| BLAKE2b-256 |
17f8e5dfef7e521b4bc06c8867661301ce514f35c69b462cc1fd63837ecb7441
|
File details
Details for the file modelscout_sdk-0.2.0-py3-none-any.whl.
File metadata
- Download URL: modelscout_sdk-0.2.0-py3-none-any.whl
- Upload date:
- Size: 80.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c546a267267a0e51230ad1148db9087629aa5b8b8a54b2eb850eb808d36a8966
|
|
| MD5 |
df86c73fdbb99ad2b316f40f9d591f7c
|
|
| BLAKE2b-256 |
2f0d39b8a928ff890a50e5dd3ac9973a15bfa4d9822a9739e95d5924d71e130b
|