Production-grade AI system evaluation and regression testing platform

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

ModelProbe

AI system evaluation and regression testing. Works locally with zero config, scales to a shared team server by changing one line.

How it works

flowchart TD
    A[Test Cases] --> B[Runner — calls your model]
    B --> C{Evaluators}
    C --> C1[exact]
    C --> C2[contains]
    C --> C3[regex]
    C --> C4[json_schema]
    C --> C5[llm_judge]
    C --> C6[hallucination]
    C6 --> H1[Self-Consistency]
    C6 --> H2[Wikidata Grounding]
    C1 & C2 & C3 & C4 & C5 & H1 & H2 --> D[Pass / Fail + Score]
    D --> E[(SQLite or API)]
    E --> F[Dashboard]

Install

pip install modelprobe             # SDK only — no server required
pip install modelprobe[server]     # SDK + dashboard + REST API

SDK — three lines to start tracing

from modelprobe import trace

@trace(suite="invoice-agent", version="v1")
def call_llm(prompt):
    return my_model(prompt)

Every call writes a run record to ~/.modelprobe/modelprobe.db automatically.

Nested traces

Wrapping multiple functions with @trace produces a parent/child tree sharing one trace_id.

from modelprobe import trace

@trace(suite="invoice-agent", version="v2", tags={"feature": "invoice"})
def run_agent(query):
    result = call_llm(query)
    data = call_tool(result)
    return call_llm(data)

@trace(suite="invoice-agent", version="v2")
def call_llm(prompt):
    return my_model(prompt)

@trace(suite="invoice-agent", version="v2")
def call_tool(data):
    return my_tool(data)

All three runs share trace_id. call_llm and call_tool have parent_id pointing at run_agent.

Run a test suite

from modelprobe import run_suite

test_cases = [
    {
        "test_case_id": "tc_001",
        "input": "What is the invoice total?",
        "expected_output": "$500",
        "eval_type": "contains",
        "eval_config": {"values": ["$500"]},
    }
]

result = run_suite(
    suite_name="invoice-agent",
    version="v2",
    test_cases=test_cases,
    runner=lambda tc: my_model(tc["input"]),
)

print(f"Pass rate: {result.pass_rate:.1%}")
print(f"Passed: {result.passed} / Failed: {result.failed} / Errored: {result.errored}")

Inline assertion

from modelprobe import assert_eval

assert_eval("The total is $500", "contains", {"values": ["$500"]})

Raises AssertionError if the evaluation fails.

Team mode — remote server

import modelprobe
modelprobe.configure(server="http://modelprobe.internal:8000")

Or set the environment variable:

export MODELPROBE_SERVER=http://modelprobe.internal:8000

All SDK calls route to the remote server. No code changes required.

Server

pip install modelprobe[server]
modelprobe start --port 8000

Dashboard at http://localhost:8000. REST API at http://localhost:8000/api. OpenAPI docs at http://localhost:8000/api/docs.

CLI

modelprobe status                                         # config and connection info
modelprobe run-suite my-agent --version v2 --file cases.json
modelprobe start --port 8000
modelprobe migrate

Evaluators

Type	Description
`exact`	Exact string match. `config: {"case_sensitive": true}`
`contains`	Substring check. `config: {"values": [...], "mode": "any\|all"}`
`regex`	Regex match. `config: {"pattern": "..."}`
`json_schema`	JSON Schema validation. `config: {"schema": {...}}`
`llm_judge`	LLM-graded rubric. `config: {"model": "...", "rubric": "..."}`
`hallucination`	Detects hallucinations via self-consistency and Wikidata verification. See below.

All evaluators return {passed, score, reason, status} where status is one of pass, fail, error, skipped.

llm_judge timeouts and errors produce status="skipped" — never status="fail".

Hallucination evaluator

Detects hallucinations without paid APIs using two strategies:

Self-consistency

Re-queries the same model multiple times with the same prompt and measures response stability. A model that knows the answer will produce it reliably; one that is guessing will vary. Based on Wang et al., "Self-Consistency Improves Chain of Thought Reasoning" (2022).

from modelprobe import assert_eval

assert_eval(
    output="Paris",
    eval_type="hallucination",
    config={
        "strategy": "consistency",
        "prompt": "What is the capital of France? Reply in one word.",
        "model": "llama3",
        "endpoint": "http://localhost:11434/api/generate",
        "samples": 5,
        "threshold": 0.5,
    },
)

Factual grounding (Wikidata)

Verifies factual claims in the output against the Wikidata knowledge graph via its REST API. Catches fabricated facts, wrong dates, and incorrect attributions.

result = assert_eval(
    output="The capital of France is Paris.",
    eval_type="hallucination",
    config={
        "strategy": "factual",
        "claims": [
            {"subject": "Q142", "property": "P36", "expected_label": "Paris"}
        ],
        "threshold": 0.5,
    },
)

Benchmark results

Benchmarked 3 local models across 60 test cases (math, factual QA, instruction following, code generation, hallucination detection):

Model	Overall	Math	Factual	Instruction	Code	Hallucination
gemma3:4b	88%	100%	92%	50%	100%	100%
codegemma:7b	82%	92%	83%	50%	100%	83%
llama3 (8B)	78%	67%	83%	50%	100%	92%

Key findings:

Hallucination evaluator detected up to 17% confabulation rates across model families
Self-consistency correctly flagged uncertain knowledge (population statistics, obscure trivia) while confirming stable recall on well-known facts
All Wikidata factual claims verified successfully against the knowledge graph
gemma3:4b (4B params) outperformed llama3 (8B) overall — smaller does not mean worse

Reproduce locally:

ollama pull gemma3:4b && ollama pull llama3 && ollama pull codegemma:7b
python benchmarks/run_benchmark.py

Configuration

Priority order (lowest to highest):

Hardcoded defaults
~/.modelprobe/config.toml
Environment variables
modelprobe.configure(**kwargs) — highest priority

Environment variables:

Variable	Purpose
`MODELPROBE_SERVER`	Remote server URL
`MODELPROBE_DB_PATH`	Local SQLite path (default: `~/.modelprobe/modelprobe.db`)
`MODELPROBE_API_KEY`	Auth token for remote server
`MODELPROBE_LLM_ENDPOINT`	LLM endpoint for `llm_judge`
`MODELPROBE_LLM_API_KEY`	API key for LLM endpoint

Data model

{
  "id": "uuid",
  "trace_id": "uuid",
  "parent_id": "uuid | null",
  "suite": "invoice-agent",
  "version": "v2",
  "run_group": "experiment_1",
  "commit_hash": "abc123",
  "tags": {"env": "staging"},
  "input": "...",
  "output": "...",
  "status": "pass | fail | error | skipped",
  "latency_ms": 142.3,
  "token_count": 218,
  "timestamp": "2026-04-11T12:00:00Z",
  "steps": []
}

REST API

Method	Path	Description
`POST`	`/api/runs`	Submit a run
`GET`	`/api/runs`	List runs with filters
`GET`	`/api/runs/{id}`	Run detail with step tree
`GET`	`/api/suites`	List suites with pass rates
`GET`	`/api/suites/{name}`	Suite detail + version history
`GET`	`/api/suites/{name}/compare?v1=x&v2=y`	Per-test-case version diff
`GET`	`/api/suites/{name}/regressions`	Test cases that regressed
`GET`	`/api/health`	Server health + uptime

All responses follow the envelope:

{
  "data": {},
  "version": "0.1.0",
  "timestamp": "...",
  "request_id": "uuid"
}

Testing

Tests are organized into three tiers:

tests/
  unit/           # isolated component tests (evaluators, trace, suite, storage, CLI, config)
  regression/     # contract tests that lock down API shapes and behavior
  security/       # penetration tests (SQL injection, input validation, API safety)

# Run all tests
pytest

# Run by category
pytest tests/unit/          # fast, isolated
pytest tests/regression/    # contract stability
pytest tests/security/      # security / pen testing

Development setup

git clone https://github.com/KamalasankariS/ModelProbe
cd ModelProbe
pip install -e ".[server,dev]"
pytest

Dashboard (requires Node.js):

cd dashboard
npm install
npm run dev      # dev server proxies /api to localhost:8000
npm run build    # outputs to modelprobe/server/static/dist/

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Kamalasankari

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Jun 15, 2026

This version

0.1.0

Jun 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelprobe-0.1.0.tar.gz (37.6 kB view details)

Uploaded Jun 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

modelprobe-0.1.0-py3-none-any.whl (45.4 kB view details)

Uploaded Jun 15, 2026 Python 3

File details

Details for the file modelprobe-0.1.0.tar.gz.

File metadata

Download URL: modelprobe-0.1.0.tar.gz
Upload date: Jun 15, 2026
Size: 37.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for modelprobe-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0bbbbb1cc5cf75f9f8d4bb4772970921c2e3ff5be3824104fa33098e579bc2e5`
MD5	`6a1f3f7094dfcd452632299101952e52`
BLAKE2b-256	`943bd1f213d39fddb804a695fefb3b0382e0e4be8014d3dc914dac9ebccc4dbb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for modelprobe-0.1.0.tar.gz:

Publisher: publish.yml on KamalasankariS/ModelProbe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: modelprobe-0.1.0.tar.gz
- Subject digest: 0bbbbb1cc5cf75f9f8d4bb4772970921c2e3ff5be3824104fa33098e579bc2e5
- Sigstore transparency entry: 1827685851
- Sigstore integration time: Jun 15, 2026
Source repository:
- Permalink: KamalasankariS/ModelProbe@6a0f8fb3ebe1af6d6b25fb7ebba1354f063a8a69
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/KamalasankariS
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6a0f8fb3ebe1af6d6b25fb7ebba1354f063a8a69
- Trigger Event: release

File details

Details for the file modelprobe-0.1.0-py3-none-any.whl.

File metadata

Download URL: modelprobe-0.1.0-py3-none-any.whl
Upload date: Jun 15, 2026
Size: 45.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for modelprobe-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4dd9836c3e5182f5eb7e9f05e44be2abbd34c9932e04e8bb25fa5828b227bed8`
MD5	`5e4fd4eeba67ae2645a7eec6e9fa7ee7`
BLAKE2b-256	`11671633721cd3f98093ee5b479413ecb0dcf96ff3a25c44bcf002c1019576bb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for modelprobe-0.1.0-py3-none-any.whl:

Publisher: publish.yml on KamalasankariS/ModelProbe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: modelprobe-0.1.0-py3-none-any.whl
- Subject digest: 4dd9836c3e5182f5eb7e9f05e44be2abbd34c9932e04e8bb25fa5828b227bed8
- Sigstore transparency entry: 1827686004
- Sigstore integration time: Jun 15, 2026
Source repository:
- Permalink: KamalasankariS/ModelProbe@6a0f8fb3ebe1af6d6b25fb7ebba1354f063a8a69
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/KamalasankariS
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6a0f8fb3ebe1af6d6b25fb7ebba1354f063a8a69
- Trigger Event: release

modelprobe 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ModelProbe

How it works

Install

SDK — three lines to start tracing

Nested traces

Run a test suite

Inline assertion

Team mode — remote server

Server

CLI

Evaluators

Hallucination evaluator

Self-consistency

Factual grounding (Wikidata)

Benchmark results

Configuration

Data model

REST API

Testing

Development setup

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance