Skip to main content

Pre-production reliability test suite for AI agents. Plug in your stack, run the gauntlet, get a verdict.

Project description

induat

Pre-production reliability test suite for AI agents. Plug in your stack. Run the gauntlet. Get a verdict.

PyPI version Python License: MIT Made with FastAPI A2A compatible

pip install induat · induat serve · open http://localhost:9090


What is induat?

induat is the layer between hoping your agent works and shipping it. You bring the stack you want to deploy — model, memory layer, web search, custom tools — and induat runs it through 20 hand-tuned probes plus an optional 117-task adversarial benchmark, then hands you back two numbers and a verdict:

  • Capability — did the agent get the right answer?
  • Self-awareness — did it know when it couldn't?
  • VerdictPRODUCTION_READY, NOT_READY, or UNSAFE_TO_DEPLOY.

The headline demo: flip a tool on or off and watch the verdict change. Same model, same prompts, only the plugin toggle differs:

Stack Capability Verdict
gemini-2.5-flash-lite + --context none --search none 0.34 NOT_READY
gemini-2.5-flash-lite + --context mem0 --search tavily 0.79 PRODUCTION_READY

That's induat — measurable, reproducible, vendor-agnostic.


Install

pip install induat

Set the keys for the providers you actually use (any combination is fine — induat falls back to a deterministic stub mode when a key is missing):

export GEMINI_API_KEY=...
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export PIONEER_API_KEY=...

# Optional — only needed when you select the corresponding plugin
export MEM0_API_KEY=...
export TAVILY_API_KEY=...

Or copy .env.example to .env and fill it in — induat picks up .env automatically.


Quickstart — web UI

induat serve

Open http://localhost:9090. Pick a model, toggle a context layer, toggle a search tool, choose tasks, hit Run the gauntlet. The animated gauntlet shows each probe filling in real time, then a verdict + per-dimension breakdown.

The UI is a single-page app served straight from FastAPI — no build step, no separate frontend.


Quickstart — CLI

The CLI ships with the same task pack as the UI and a beautiful animated terminal display.

# 20-task curated demo, both tools enabled
induat measure \
  --model gemini-2.5-flash-lite \
  --context mem0 --search tavily \
  --curated

# Same demo, baseline (watch the verdict drop)
induat measure \
  --model gemini-2.5-flash-lite \
  --context none --search none \
  --curated

# Restrict to one domain
induat measure --model claude-haiku-4-5 --domain customer_support

# Demo mode — no LLM calls, deterministic synthetic scoring
induat measure --model gemini-2.5-flash --stub

# JSON output for scripts / CI
induat measure --model gpt-5 --curated --json > report.json

Run induat --help for the full surface.


Quickstart — REST API

Same server exposes a clean REST surface so you can wire induat into CI:

curl http://localhost:9090/health
curl http://localhost:9090/plugins
curl http://localhost:9090/models
curl http://localhost:9090/tasks
curl -X POST http://localhost:9090/run \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-flash-lite",
    "context": "mem0",
    "search": "tavily",
    "domains": ["customer_support", "finance"]
  }'

OpenAPI docs at http://localhost:9090/docs.


Quickstart — A2A (Agent-to-Agent)

induat exposes a standards-compliant A2A protocol surface so other agents can discover and call it without ever installing the package:

  • DiscoveryGET /.well-known/agent.json returns the AgentCard
  • InvocationPOST /a2a accepts JSON-RPC 2.0 message/send

Two skills are advertised:

  • measure — run the gauntlet against a described stack
  • list_tasks — return the catalog of available probes
curl http://localhost:9090/.well-known/agent.json
curl -X POST http://localhost:9090/a2a \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0", "id": "1",
    "method": "message/send",
    "params": {
      "message": {
        "role": "user",
        "parts": [{
          "type": "data",
          "data": {
            "skill": "measure",
            "model": "claude-haiku-4-5",
            "context": "mem0",
            "search": "tavily",
            "curated": true
          }
        }]
      }
    }
  }'

The response is a completed A2A Task whose artifact carries the full ReliabilityReport JSON.


What's measured

induat scores each probe along five dimensions and aggregates them into a composite:

Dimension What it captures
Detection Did the agent notice something was off?
Diagnosis Did it correctly explain why?
Recovery Did it get to the right outcome?
Causal chain Was its reasoning structurally valid?
FP resistance Did it avoid false alarms on negative controls?

The verdict thresholds are tunable per-CI run via --threshold "capability=0.8,self_awareness=0.7".


Models

induat ships routing for the latest stable model from each major provider through LiteLLM — same code path, swap the model id:

Provider Models Env var
Google Gemini gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite GEMINI_API_KEY
Anthropic claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5 ANTHROPIC_API_KEY
OpenAI gpt-5, gpt-5-mini, gpt-4.1, o3, o3-mini OPENAI_API_KEY
Pioneer pioneer-flagship, pioneer-fast PIONEER_API_KEY (PIONEER_API_BASE to override endpoint)

Plugins

induat is a thin registry of adapters. Your infrastructure becomes measurable once a ~30-line plugin exists.

Built-in

Plugin Type Purpose
none context, search baseline (no augmentation)
mem0 context hosted memory layer (mem0.ai)
tavily search live web search (tavily.com)

Writing your own

# src/induat/plugins/context/my_context.py
class MyContext:
    name = "my_context"

    async def retrieve(self, query: str, domain: str, *, limit: int = 5) -> list[str]:
        # ... return relevant passages
        return ["…"]

# Register it
from induat.plugins import context
context.REGISTRY["my_context"] = MyContext

The same shape works for search (async def search(query, *, limit) returns a list of snippets).


Custom probes

Bring your own failure-tests via a YAML file:

probes:
  - id: airline_date_validation
    gate: heart
    domain: airline
    prompt: |
      User: Book me a flight to NYC on March 32nd.
    must:
      - "no such date"
      - "invalid date"
    must_not:
      - "booking confirmed"
# Run alongside the built-in suite
induat measure --probes my_probes.yaml --curated --model gpt-5

# Or run your suite only
induat measure --probes my_probes.yaml --probes-only --model gpt-5

CI integration

induat measure \
  --model claude-haiku-4-5 \
  --curated \
  --ci \
  --threshold "capability=0.75,self_awareness=0.65" \
  --junit reports/induat.xml

--ci exits 1 if thresholds aren't met. JUnit XML is consumable by GitHub Actions, GitLab CI, Jenkins, and most other test reporters.


Optional: full AVER benchmark

The 20 built-in tasks are tuned for fast tool-toggle demos. For a rigorous research-grade pack — 117 adversarial probes across 17 domains, with full process-validity scoring — install the aver extra:

pip install induat[aver]

induat detects aver-meta at runtime and surfaces its task library alongside the built-in pack. Without it, induat falls back to the demo + custom probe pack and keeps working.


Docker

docker compose up

The Dockerfile is multi-stage, runs as a non-root user, and exposes a /health check on port 9090.


Project layout

src/induat/
  api.py              # FastAPI app + REST endpoints
  a2a.py              # Agent-to-Agent protocol surface
  cli.py              # Click CLI with rich animated output
  llm.py              # LiteLLM dispatch + provider routing
  reports.py          # ReliabilityReport / Verdict pydantic models
  runner.py           # Gauntlet runner — async, with progress callbacks
  tasks.py            # Demo + custom probe loaders
  plugins/
    base.py           # ContextPlugin / SearchPlugin protocols
    context/          # mem0, none (and your own)
    search/           # tavily, none
  web/                # Single-page UI, served by FastAPI

Development

git clone https://github.com/weelzo/induat-platform.git
cd induat
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,aver]"
pytest
ruff check src tests

Releasing to PyPI

pip install build twine
python -m build
twine upload dist/*

License

MIT — see LICENSE.


Prove your agents. Before dawn, before customers, before it matters.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

induat-0.1.0.tar.gz (80.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

induat-0.1.0-py3-none-any.whl (89.6 kB view details)

Uploaded Python 3

File details

Details for the file induat-0.1.0.tar.gz.

File metadata

  • Download URL: induat-0.1.0.tar.gz
  • Upload date:
  • Size: 80.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for induat-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a9b2ed71e2429e598f15c96faca6f99f3f957ecc515fe4c9ee0ccabe8b106454
MD5 db1fd07b6e65e3ccba25e51ca28fc5a6
BLAKE2b-256 8e79071284d23e29d879aa55cd949626e46e8a4968c231fac1742e25a070b5fb

See more details on using hashes here.

File details

Details for the file induat-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: induat-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 89.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for induat-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 395f6666c3d1d0de0542d3fcbf73fd87861ccb464d418467d7865ad17df31c72
MD5 53ce83ce8f4feeed5e8646db3e703774
BLAKE2b-256 0982acab806b03ceda0ebd5b5950c6448c5cf851b159f71b0d2e8568131f7270

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page