Skip to main content

Production-grade AI system evaluation and regression testing platform

Project description

ci nightly — integration + e2e License: MIT Python 3.10+ codecov

ModelProbe

AI system evaluation and regression testing. Works locally with zero config, scales to a shared team server by changing one line.


How it works

flowchart TD
    A[Test Cases] --> B[Runner — calls your model]
    B --> C{Evaluators}
    C --> C1[exact]
    C --> C2[contains]
    C --> C3[regex]
    C --> C4[json_schema]
    C --> C5[llm_judge]
    C --> C6[hallucination]
    C6 --> H1[Self-Consistency]
    C6 --> H2[Wikidata Grounding]
    C1 & C2 & C3 & C4 & C5 & H1 & H2 --> D[Pass / Fail + Score]
    D --> E[(SQLite or API)]
    E --> F[Dashboard]

Install

pip install modelprobe             # SDK only — no server required
pip install modelprobe[server]     # SDK + dashboard + REST API

SDK — three lines to start tracing

from modelprobe import trace

@trace(suite="invoice-agent", version="v1")
def call_llm(prompt):
    return my_model(prompt)

Every call writes a run record to ~/.modelprobe/modelprobe.db automatically.


Nested traces

Wrapping multiple functions with @trace produces a parent/child tree sharing one trace_id.

from modelprobe import trace

@trace(suite="invoice-agent", version="v2", tags={"feature": "invoice"})
def run_agent(query):
    result = call_llm(query)
    data = call_tool(result)
    return call_llm(data)

@trace(suite="invoice-agent", version="v2")
def call_llm(prompt):
    return my_model(prompt)

@trace(suite="invoice-agent", version="v2")
def call_tool(data):
    return my_tool(data)

All three runs share trace_id. call_llm and call_tool have parent_id pointing at run_agent.


Run a test suite

from modelprobe import run_suite

test_cases = [
    {
        "test_case_id": "tc_001",
        "input": "What is the invoice total?",
        "expected_output": "$500",
        "eval_type": "contains",
        "eval_config": {"values": ["$500"]},
    }
]

result = run_suite(
    suite_name="invoice-agent",
    version="v2",
    test_cases=test_cases,
    runner=lambda tc: my_model(tc["input"]),
)

print(f"Pass rate: {result.pass_rate:.1%}")
print(f"Passed: {result.passed} / Failed: {result.failed} / Errored: {result.errored}")

Inline assertion

from modelprobe import assert_eval

assert_eval("The total is $500", "contains", {"values": ["$500"]})

Raises AssertionError if the evaluation fails.


Team mode — remote server

import modelprobe
modelprobe.configure(server="http://modelprobe.internal:8000")

Or set the environment variable:

export MODELPROBE_SERVER=http://modelprobe.internal:8000

All SDK calls route to the remote server. No code changes required.


Server

pip install modelprobe[server]
modelprobe start --port 8000

Dashboard at http://localhost:8000. REST API at http://localhost:8000/api. OpenAPI docs at http://localhost:8000/api/docs.


CLI

modelprobe status                                         # config and connection info
modelprobe run-suite my-agent --version v2 --file cases.json
modelprobe start --port 8000
modelprobe migrate

Evaluators

Type Description
exact Exact string match. config: {"case_sensitive": true}
contains Substring check. config: {"values": [...], "mode": "any|all"}
regex Regex match. config: {"pattern": "..."}
json_schema JSON Schema validation. config: {"schema": {...}}
llm_judge LLM-graded rubric. config: {"model": "...", "rubric": "..."}
hallucination Detects hallucinations via self-consistency and Wikidata verification. See below.

All evaluators return {passed, score, reason, status} where status is one of pass, fail, error, skipped.

llm_judge timeouts and errors produce status="skipped" — never status="fail".


Hallucination evaluator

Detects hallucinations without paid APIs using two strategies:

Self-consistency

Re-queries the same model multiple times with the same prompt and measures response stability. A model that knows the answer will produce it reliably; one that is guessing will vary. Based on Wang et al., "Self-Consistency Improves Chain of Thought Reasoning" (2022).

from modelprobe import assert_eval

assert_eval(
    output="Paris",
    eval_type="hallucination",
    config={
        "strategy": "consistency",
        "prompt": "What is the capital of France? Reply in one word.",
        "model": "llama3",
        "endpoint": "http://localhost:11434/api/generate",
        "samples": 5,
        "threshold": 0.5,
    },
)

Factual grounding (Wikidata)

Verifies factual claims in the output against the Wikidata knowledge graph via its REST API. Catches fabricated facts, wrong dates, and incorrect attributions.

result = assert_eval(
    output="The capital of France is Paris.",
    eval_type="hallucination",
    config={
        "strategy": "factual",
        "claims": [
            {"subject": "Q142", "property": "P36", "expected_label": "Paris"}
        ],
        "threshold": 0.5,
    },
)

Benchmark results

Benchmarked 3 local models across 60 test cases (math, factual QA, instruction following, code generation, hallucination detection):

Model Overall Math Factual Instruction Code Hallucination
gemma3:4b 88% 100% 92% 50% 100% 100%
codegemma:7b 82% 92% 83% 50% 100% 83%
llama3 (8B) 78% 67% 83% 50% 100% 92%

Key findings:

  • Hallucination evaluator detected up to 17% confabulation rates across model families
  • Self-consistency correctly flagged uncertain knowledge (population statistics, obscure trivia) while confirming stable recall on well-known facts
  • All Wikidata factual claims verified successfully against the knowledge graph
  • gemma3:4b (4B params) outperformed llama3 (8B) overall — smaller does not mean worse

Reproduce locally:

ollama pull gemma3:4b && ollama pull llama3 && ollama pull codegemma:7b
python benchmarks/run_benchmark.py

Configuration

Priority order (lowest to highest):

  1. Hardcoded defaults
  2. ~/.modelprobe/config.toml
  3. Environment variables
  4. modelprobe.configure(**kwargs) — highest priority

Environment variables:

Variable Purpose
MODELPROBE_SERVER Remote server URL
MODELPROBE_DB_PATH Local SQLite path (default: ~/.modelprobe/modelprobe.db)
MODELPROBE_API_KEY Auth token for remote server
MODELPROBE_LLM_ENDPOINT LLM endpoint for llm_judge
MODELPROBE_LLM_API_KEY API key for LLM endpoint

Data model

{
  "id": "uuid",
  "trace_id": "uuid",
  "parent_id": "uuid | null",
  "suite": "invoice-agent",
  "version": "v2",
  "run_group": "experiment_1",
  "commit_hash": "abc123",
  "tags": {"env": "staging"},
  "input": "...",
  "output": "...",
  "status": "pass | fail | error | skipped",
  "latency_ms": 142.3,
  "token_count": 218,
  "timestamp": "2026-04-11T12:00:00Z",
  "steps": []
}

REST API

Method Path Description
POST /api/runs Submit a run
GET /api/runs List runs with filters
GET /api/runs/{id} Run detail with step tree
GET /api/suites List suites with pass rates
GET /api/suites/{name} Suite detail + version history
GET /api/suites/{name}/compare?v1=x&v2=y Per-test-case version diff
GET /api/suites/{name}/regressions Test cases that regressed
GET /api/health Server health + uptime

All responses follow the envelope:

{
  "data": {},
  "version": "0.1.0",
  "timestamp": "...",
  "request_id": "uuid"
}

Testing

Tests are organized into three tiers:

tests/
  unit/           # isolated component tests (evaluators, trace, suite, storage, CLI, config)
  regression/     # contract tests that lock down API shapes and behavior
  security/       # penetration tests (SQL injection, input validation, API safety)
# Run all tests
pytest

# Run by category
pytest tests/unit/          # fast, isolated
pytest tests/regression/    # contract stability
pytest tests/security/      # security / pen testing

Development setup

git clone https://github.com/KamalasankariS/ModelProbe
cd ModelProbe
pip install -e ".[server,dev]"
pytest

Dashboard (requires Node.js):

cd dashboard
npm install
npm run dev      # dev server proxies /api to localhost:8000
npm run build    # outputs to modelprobe/server/static/dist/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelprobe-0.1.1.tar.gz (38.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

modelprobe-0.1.1-py3-none-any.whl (46.3 kB view details)

Uploaded Python 3

File details

Details for the file modelprobe-0.1.1.tar.gz.

File metadata

  • Download URL: modelprobe-0.1.1.tar.gz
  • Upload date:
  • Size: 38.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for modelprobe-0.1.1.tar.gz
Algorithm Hash digest
SHA256 567cec2971b2d868eee51310de7eab487e16b2051e08d3511578c9c17a2c6c7a
MD5 03e4804cd167ad50248b1f91d66546cb
BLAKE2b-256 1942748ff0b60c060d8e555443884c4401ee7f2d9308144b206bc6eca44d1ccc

See more details on using hashes here.

Provenance

The following attestation bundles were made for modelprobe-0.1.1.tar.gz:

Publisher: publish.yml on KamalasankariS/ModelProbe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file modelprobe-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: modelprobe-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 46.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for modelprobe-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3efc17b196a9972e43da53c67a13dd89bb377d439a74997f00031c939ac02c9c
MD5 3efb91efb27fb6036826866d223407b4
BLAKE2b-256 0b442af7e89c1b566a093b362e55985fee7613a47ea77dd9c595c86bfa9cae2e

See more details on using hashes here.

Provenance

The following attestation bundles were made for modelprobe-0.1.1-py3-none-any.whl:

Publisher: publish.yml on KamalasankariS/ModelProbe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page