Production-grade AI system evaluation and regression testing platform
Project description
ModelProbe
AI system evaluation and regression testing. Works locally with zero config, scales to a shared team server by changing one line.
How it works
flowchart TD
A[Test Cases] --> B[Runner — calls your model]
B --> C{Evaluators}
C --> C1[exact]
C --> C2[contains]
C --> C3[regex]
C --> C4[json_schema]
C --> C5[llm_judge]
C --> C6[hallucination]
C6 --> H1[Self-Consistency]
C6 --> H2[Wikidata Grounding]
C1 & C2 & C3 & C4 & C5 & H1 & H2 --> D[Pass / Fail + Score]
D --> E[(SQLite or API)]
E --> F[Dashboard]
Install
pip install modelprobe # SDK only — no server required
pip install modelprobe[server] # SDK + dashboard + REST API
SDK — three lines to start tracing
from modelprobe import trace
@trace(suite="invoice-agent", version="v1")
def call_llm(prompt):
return my_model(prompt)
Every call writes a run record to ~/.modelprobe/modelprobe.db automatically.
Nested traces
Wrapping multiple functions with @trace produces a parent/child tree sharing one trace_id.
from modelprobe import trace
@trace(suite="invoice-agent", version="v2", tags={"feature": "invoice"})
def run_agent(query):
result = call_llm(query)
data = call_tool(result)
return call_llm(data)
@trace(suite="invoice-agent", version="v2")
def call_llm(prompt):
return my_model(prompt)
@trace(suite="invoice-agent", version="v2")
def call_tool(data):
return my_tool(data)
All three runs share trace_id. call_llm and call_tool have parent_id pointing at run_agent.
Run a test suite
from modelprobe import run_suite
test_cases = [
{
"test_case_id": "tc_001",
"input": "What is the invoice total?",
"expected_output": "$500",
"eval_type": "contains",
"eval_config": {"values": ["$500"]},
}
]
result = run_suite(
suite_name="invoice-agent",
version="v2",
test_cases=test_cases,
runner=lambda tc: my_model(tc["input"]),
)
print(f"Pass rate: {result.pass_rate:.1%}")
print(f"Passed: {result.passed} / Failed: {result.failed} / Errored: {result.errored}")
Inline assertion
from modelprobe import assert_eval
assert_eval("The total is $500", "contains", {"values": ["$500"]})
Raises AssertionError if the evaluation fails.
Team mode — remote server
import modelprobe
modelprobe.configure(server="http://modelprobe.internal:8000")
Or set the environment variable:
export MODELPROBE_SERVER=http://modelprobe.internal:8000
All SDK calls route to the remote server. No code changes required.
Server
pip install modelprobe[server]
modelprobe start --port 8000
Dashboard at http://localhost:8000. REST API at http://localhost:8000/api. OpenAPI docs at http://localhost:8000/api/docs.
CLI
modelprobe status # config and connection info
modelprobe run-suite my-agent --version v2 --file cases.json
modelprobe start --port 8000
modelprobe migrate
Evaluators
| Type | Description |
|---|---|
exact |
Exact string match. config: {"case_sensitive": true} |
contains |
Substring check. config: {"values": [...], "mode": "any|all"} |
regex |
Regex match. config: {"pattern": "..."} |
json_schema |
JSON Schema validation. config: {"schema": {...}} |
llm_judge |
LLM-graded rubric. config: {"model": "...", "rubric": "..."} |
hallucination |
Detects hallucinations via self-consistency and Wikidata verification. See below. |
All evaluators return {passed, score, reason, status} where status is one of pass, fail, error, skipped.
llm_judge timeouts and errors produce status="skipped" — never status="fail".
Hallucination evaluator
Detects hallucinations without paid APIs using two strategies:
Self-consistency
Re-queries the same model multiple times with the same prompt and measures response stability. A model that knows the answer will produce it reliably; one that is guessing will vary. Based on Wang et al., "Self-Consistency Improves Chain of Thought Reasoning" (2022).
from modelprobe import assert_eval
assert_eval(
output="Paris",
eval_type="hallucination",
config={
"strategy": "consistency",
"prompt": "What is the capital of France? Reply in one word.",
"model": "llama3",
"endpoint": "http://localhost:11434/api/generate",
"samples": 5,
"threshold": 0.5,
},
)
Factual grounding (Wikidata)
Verifies factual claims in the output against the Wikidata knowledge graph via its REST API. Catches fabricated facts, wrong dates, and incorrect attributions.
result = assert_eval(
output="The capital of France is Paris.",
eval_type="hallucination",
config={
"strategy": "factual",
"claims": [
{"subject": "Q142", "property": "P36", "expected_label": "Paris"}
],
"threshold": 0.5,
},
)
Benchmark results
Benchmarked 3 local models across 60 test cases (math, factual QA, instruction following, code generation, hallucination detection):
| Model | Overall | Math | Factual | Instruction | Code | Hallucination |
|---|---|---|---|---|---|---|
| gemma3:4b | 88% | 100% | 92% | 50% | 100% | 100% |
| codegemma:7b | 82% | 92% | 83% | 50% | 100% | 83% |
| llama3 (8B) | 78% | 67% | 83% | 50% | 100% | 92% |
Key findings:
- Hallucination evaluator detected up to 17% confabulation rates across model families
- Self-consistency correctly flagged uncertain knowledge (population statistics, obscure trivia) while confirming stable recall on well-known facts
- All Wikidata factual claims verified successfully against the knowledge graph
- gemma3:4b (4B params) outperformed llama3 (8B) overall — smaller does not mean worse
Reproduce locally:
ollama pull gemma3:4b && ollama pull llama3 && ollama pull codegemma:7b
python benchmarks/run_benchmark.py
Configuration
Priority order (lowest to highest):
- Hardcoded defaults
~/.modelprobe/config.toml- Environment variables
modelprobe.configure(**kwargs)— highest priority
Environment variables:
| Variable | Purpose |
|---|---|
MODELPROBE_SERVER |
Remote server URL |
MODELPROBE_DB_PATH |
Local SQLite path (default: ~/.modelprobe/modelprobe.db) |
MODELPROBE_API_KEY |
Auth token for remote server |
MODELPROBE_LLM_ENDPOINT |
LLM endpoint for llm_judge |
MODELPROBE_LLM_API_KEY |
API key for LLM endpoint |
Data model
{
"id": "uuid",
"trace_id": "uuid",
"parent_id": "uuid | null",
"suite": "invoice-agent",
"version": "v2",
"run_group": "experiment_1",
"commit_hash": "abc123",
"tags": {"env": "staging"},
"input": "...",
"output": "...",
"status": "pass | fail | error | skipped",
"latency_ms": 142.3,
"token_count": 218,
"timestamp": "2026-04-11T12:00:00Z",
"steps": []
}
REST API
| Method | Path | Description |
|---|---|---|
POST |
/api/runs |
Submit a run |
GET |
/api/runs |
List runs with filters |
GET |
/api/runs/{id} |
Run detail with step tree |
GET |
/api/suites |
List suites with pass rates |
GET |
/api/suites/{name} |
Suite detail + version history |
GET |
/api/suites/{name}/compare?v1=x&v2=y |
Per-test-case version diff |
GET |
/api/suites/{name}/regressions |
Test cases that regressed |
GET |
/api/health |
Server health + uptime |
All responses follow the envelope:
{
"data": {},
"version": "0.1.0",
"timestamp": "...",
"request_id": "uuid"
}
Testing
Tests are organized into three tiers:
tests/
unit/ # isolated component tests (evaluators, trace, suite, storage, CLI, config)
regression/ # contract tests that lock down API shapes and behavior
security/ # penetration tests (SQL injection, input validation, API safety)
# Run all tests
pytest
# Run by category
pytest tests/unit/ # fast, isolated
pytest tests/regression/ # contract stability
pytest tests/security/ # security / pen testing
Development setup
git clone https://github.com/KamalasankariS/ModelProbe
cd ModelProbe
pip install -e ".[server,dev]"
pytest
Dashboard (requires Node.js):
cd dashboard
npm install
npm run dev # dev server proxies /api to localhost:8000
npm run build # outputs to modelprobe/server/static/dist/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file modelprobe-0.1.0.tar.gz.
File metadata
- Download URL: modelprobe-0.1.0.tar.gz
- Upload date:
- Size: 37.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0bbbbb1cc5cf75f9f8d4bb4772970921c2e3ff5be3824104fa33098e579bc2e5
|
|
| MD5 |
6a1f3f7094dfcd452632299101952e52
|
|
| BLAKE2b-256 |
943bd1f213d39fddb804a695fefb3b0382e0e4be8014d3dc914dac9ebccc4dbb
|
Provenance
The following attestation bundles were made for modelprobe-0.1.0.tar.gz:
Publisher:
publish.yml on KamalasankariS/ModelProbe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
modelprobe-0.1.0.tar.gz -
Subject digest:
0bbbbb1cc5cf75f9f8d4bb4772970921c2e3ff5be3824104fa33098e579bc2e5 - Sigstore transparency entry: 1827685851
- Sigstore integration time:
-
Permalink:
KamalasankariS/ModelProbe@6a0f8fb3ebe1af6d6b25fb7ebba1354f063a8a69 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/KamalasankariS
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6a0f8fb3ebe1af6d6b25fb7ebba1354f063a8a69 -
Trigger Event:
release
-
Statement type:
File details
Details for the file modelprobe-0.1.0-py3-none-any.whl.
File metadata
- Download URL: modelprobe-0.1.0-py3-none-any.whl
- Upload date:
- Size: 45.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4dd9836c3e5182f5eb7e9f05e44be2abbd34c9932e04e8bb25fa5828b227bed8
|
|
| MD5 |
5e4fd4eeba67ae2645a7eec6e9fa7ee7
|
|
| BLAKE2b-256 |
11671633721cd3f98093ee5b479413ecb0dcf96ff3a25c44bcf002c1019576bb
|
Provenance
The following attestation bundles were made for modelprobe-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on KamalasankariS/ModelProbe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
modelprobe-0.1.0-py3-none-any.whl -
Subject digest:
4dd9836c3e5182f5eb7e9f05e44be2abbd34c9932e04e8bb25fa5828b227bed8 - Sigstore transparency entry: 1827686004
- Sigstore integration time:
-
Permalink:
KamalasankariS/ModelProbe@6a0f8fb3ebe1af6d6b25fb7ebba1354f063a8a69 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/KamalasankariS
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6a0f8fb3ebe1af6d6b25fb7ebba1354f063a8a69 -
Trigger Event:
release
-
Statement type: