Skip to main content

Production-grade AI system evaluation and regression testing platform

Project description

ci nightly — integration + e2e License: MIT Python 3.10+ codecov

ModelProbe

AI system evaluation and regression testing. Works locally with zero config, scales to a shared team server by changing one line.


How it works

flowchart TD
    A[Test Cases] --> B[Runner — calls your model]
    B --> C{Evaluators}
    C --> C1[exact]
    C --> C2[contains]
    C --> C3[regex]
    C --> C4[json_schema]
    C --> C5[llm_judge]
    C --> C6[hallucination]
    C6 --> H1[Self-Consistency]
    C6 --> H2[Wikidata Grounding]
    C1 & C2 & C3 & C4 & C5 & H1 & H2 --> D[Pass / Fail + Score]
    D --> E[(SQLite or API)]
    E --> F[Dashboard]

Install

pip install modelprobe             # SDK only — no server required
pip install modelprobe[server]     # SDK + dashboard + REST API

SDK — three lines to start tracing

from modelprobe import trace

@trace(suite="invoice-agent", version="v1")
def call_llm(prompt):
    return my_model(prompt)

Every call writes a run record to ~/.modelprobe/modelprobe.db automatically.


Nested traces

Wrapping multiple functions with @trace produces a parent/child tree sharing one trace_id.

from modelprobe import trace

@trace(suite="invoice-agent", version="v2", tags={"feature": "invoice"})
def run_agent(query):
    result = call_llm(query)
    data = call_tool(result)
    return call_llm(data)

@trace(suite="invoice-agent", version="v2")
def call_llm(prompt):
    return my_model(prompt)

@trace(suite="invoice-agent", version="v2")
def call_tool(data):
    return my_tool(data)

All three runs share trace_id. call_llm and call_tool have parent_id pointing at run_agent.


Run a test suite

from modelprobe import run_suite

test_cases = [
    {
        "test_case_id": "tc_001",
        "input": "What is the invoice total?",
        "expected_output": "$500",
        "eval_type": "contains",
        "eval_config": {"values": ["$500"]},
    }
]

result = run_suite(
    suite_name="invoice-agent",
    version="v2",
    test_cases=test_cases,
    runner=lambda tc: my_model(tc["input"]),
)

print(f"Pass rate: {result.pass_rate:.1%}")
print(f"Passed: {result.passed} / Failed: {result.failed} / Errored: {result.errored}")

Inline assertion

from modelprobe import assert_eval

assert_eval("The total is $500", "contains", {"values": ["$500"]})

Raises AssertionError if the evaluation fails.


Team mode — remote server

import modelprobe
modelprobe.configure(server="http://modelprobe.internal:8000")

Or set the environment variable:

export MODELPROBE_SERVER=http://modelprobe.internal:8000

All SDK calls route to the remote server. No code changes required.


Server

pip install modelprobe[server]
modelprobe start --port 8000

Dashboard at http://localhost:8000. REST API at http://localhost:8000/api. OpenAPI docs at http://localhost:8000/api/docs.


CLI

modelprobe status                                         # config and connection info
modelprobe run-suite my-agent --version v2 --file cases.json
modelprobe start --port 8000
modelprobe migrate

Evaluators

Type Description
exact Exact string match. config: {"case_sensitive": true}
contains Substring check. config: {"values": [...], "mode": "any|all"}
regex Regex match. config: {"pattern": "..."}
json_schema JSON Schema validation. config: {"schema": {...}}
llm_judge LLM-graded rubric. config: {"model": "...", "rubric": "..."}
hallucination Detects hallucinations via self-consistency and Wikidata verification. See below.

All evaluators return {passed, score, reason, status} where status is one of pass, fail, error, skipped.

llm_judge timeouts and errors produce status="skipped" — never status="fail".


Hallucination evaluator

Detects hallucinations without paid APIs using two strategies:

Self-consistency

Re-queries the same model multiple times with the same prompt and measures response stability. A model that knows the answer will produce it reliably; one that is guessing will vary. Based on Wang et al., "Self-Consistency Improves Chain of Thought Reasoning" (2022).

from modelprobe import assert_eval

assert_eval(
    output="Paris",
    eval_type="hallucination",
    config={
        "strategy": "consistency",
        "prompt": "What is the capital of France? Reply in one word.",
        "model": "llama3",
        "endpoint": "http://localhost:11434/api/generate",
        "samples": 5,
        "threshold": 0.5,
    },
)

Factual grounding (Wikidata)

Verifies factual claims in the output against the Wikidata knowledge graph via its REST API. Catches fabricated facts, wrong dates, and incorrect attributions.

result = assert_eval(
    output="The capital of France is Paris.",
    eval_type="hallucination",
    config={
        "strategy": "factual",
        "claims": [
            {"subject": "Q142", "property": "P36", "expected_label": "Paris"}
        ],
        "threshold": 0.5,
    },
)

Benchmark results

Benchmarked 3 local models across 60 test cases (math, factual QA, instruction following, code generation, hallucination detection):

Model Overall Math Factual Instruction Code Hallucination
gemma3:4b 88% 100% 92% 50% 100% 100%
codegemma:7b 82% 92% 83% 50% 100% 83%
llama3 (8B) 78% 67% 83% 50% 100% 92%

Key findings:

  • Hallucination evaluator detected up to 17% confabulation rates across model families
  • Self-consistency correctly flagged uncertain knowledge (population statistics, obscure trivia) while confirming stable recall on well-known facts
  • All Wikidata factual claims verified successfully against the knowledge graph
  • gemma3:4b (4B params) outperformed llama3 (8B) overall — smaller does not mean worse

Reproduce locally:

ollama pull gemma3:4b && ollama pull llama3 && ollama pull codegemma:7b
python benchmarks/run_benchmark.py

Configuration

Priority order (lowest to highest):

  1. Hardcoded defaults
  2. ~/.modelprobe/config.toml
  3. Environment variables
  4. modelprobe.configure(**kwargs) — highest priority

Environment variables:

Variable Purpose
MODELPROBE_SERVER Remote server URL
MODELPROBE_DB_PATH Local SQLite path (default: ~/.modelprobe/modelprobe.db)
MODELPROBE_API_KEY Auth token for remote server
MODELPROBE_LLM_ENDPOINT LLM endpoint for llm_judge
MODELPROBE_LLM_API_KEY API key for LLM endpoint

Data model

{
  "id": "uuid",
  "trace_id": "uuid",
  "parent_id": "uuid | null",
  "suite": "invoice-agent",
  "version": "v2",
  "run_group": "experiment_1",
  "commit_hash": "abc123",
  "tags": {"env": "staging"},
  "input": "...",
  "output": "...",
  "status": "pass | fail | error | skipped",
  "latency_ms": 142.3,
  "token_count": 218,
  "timestamp": "2026-04-11T12:00:00Z",
  "steps": []
}

REST API

Method Path Description
POST /api/runs Submit a run
GET /api/runs List runs with filters
GET /api/runs/{id} Run detail with step tree
GET /api/suites List suites with pass rates
GET /api/suites/{name} Suite detail + version history
GET /api/suites/{name}/compare?v1=x&v2=y Per-test-case version diff
GET /api/suites/{name}/regressions Test cases that regressed
GET /api/health Server health + uptime

All responses follow the envelope:

{
  "data": {},
  "version": "0.1.0",
  "timestamp": "...",
  "request_id": "uuid"
}

Testing

Tests are organized into three tiers:

tests/
  unit/           # isolated component tests (evaluators, trace, suite, storage, CLI, config)
  regression/     # contract tests that lock down API shapes and behavior
  security/       # penetration tests (SQL injection, input validation, API safety)
# Run all tests
pytest

# Run by category
pytest tests/unit/          # fast, isolated
pytest tests/regression/    # contract stability
pytest tests/security/      # security / pen testing

Development setup

git clone https://github.com/KamalasankariS/ModelProbe
cd ModelProbe
pip install -e ".[server,dev]"
pytest

Dashboard (requires Node.js):

cd dashboard
npm install
npm run dev      # dev server proxies /api to localhost:8000
npm run build    # outputs to modelprobe/server/static/dist/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelprobe-0.1.0.tar.gz (37.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

modelprobe-0.1.0-py3-none-any.whl (45.4 kB view details)

Uploaded Python 3

File details

Details for the file modelprobe-0.1.0.tar.gz.

File metadata

  • Download URL: modelprobe-0.1.0.tar.gz
  • Upload date:
  • Size: 37.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for modelprobe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0bbbbb1cc5cf75f9f8d4bb4772970921c2e3ff5be3824104fa33098e579bc2e5
MD5 6a1f3f7094dfcd452632299101952e52
BLAKE2b-256 943bd1f213d39fddb804a695fefb3b0382e0e4be8014d3dc914dac9ebccc4dbb

See more details on using hashes here.

Provenance

The following attestation bundles were made for modelprobe-0.1.0.tar.gz:

Publisher: publish.yml on KamalasankariS/ModelProbe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file modelprobe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: modelprobe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 45.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for modelprobe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4dd9836c3e5182f5eb7e9f05e44be2abbd34c9932e04e8bb25fa5828b227bed8
MD5 5e4fd4eeba67ae2645a7eec6e9fa7ee7
BLAKE2b-256 11671633721cd3f98093ee5b479413ecb0dcf96ff3a25c44bcf002c1019576bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for modelprobe-0.1.0-py3-none-any.whl:

Publisher: publish.yml on KamalasankariS/ModelProbe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page