Read-only disaggregated-serving diagnostics for vLLM, SGLang, Dynamo, and llm-d.

These details have not been verified by PyPI

Project links

Project description

InferGuard

Read-only disaggregated-serving diagnostics for vLLM, SGLang, Dynamo, and llm-d.

What is this?

InferGuard is an OSS CLI and MCP server for validating inference benchmark evidence, profiling OpenAI-compatible endpoints, collecting engine/GPU timelines, and turning completed runs into refusal-gated operator reports. It is built for engineers running production-like vLLM, SGLang, Dynamo, LMCache, and llm-d stacks on GPU fleets where incomplete evidence is worse than no evidence. InferGuard does not promise every model fits every GPU. It tells the operator what fits, what fails, why it fails, and what hardware/config to use next.

Quick start (60 seconds)

pip install inferguard

# Generate a local synthetic GPU bundle for smoke testing.
inferguard simulate-gpu --results-root /tmp/inferguard-smoke --hardware b200 --engine vllm

# Validate a completed run. Synthetic smoke tests intentionally do not pass --strict.
inferguard validate-completed --results-root /tmp/inferguard-smoke || true

# Profile per-request latency against an OpenAI-compatible endpoint.
cat >/tmp/inferguard-requests.jsonl <<'JSONL'
{"request_id":"doc-001","messages":[{"role":"user","content":"Reply with one short sentence about InferGuard."}],"max_tokens":24}
JSONL

inferguard request-profile \
  --output-dir /tmp/inferguard-profile \
  --endpoint http://localhost:8000/v1/chat/completions \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --input-jsonl /tmp/inferguard-requests.jsonl \
  --concurrency 1 \
  --stream

# Diagnose a completed job directory once request, launch, metrics, and validation artifacts exist.
inferguard diagnose-bottleneck --job-dir /path/to/results/jobs/<job-id>

From a source checkout, replace inferguard with PYTHONPATH=src python3 -m inferguard.cli.

Why InferGuard?

NeoCloud and platform engineers need honest evidence for DSv4-class serving stacks on H100, H200, B200, B300, GB200, and GB300. Most benchmark wrappers are happy to emit a report even when the request rows are empty, the healthcheck failed, DCGM was missing, or the model never actually fit in HBM.

InferGuard's bias is the opposite:

refuse or downgrade when required artifacts are missing;
separate synthetic smoke tests from live evidence;
keep network behavior limited to endpoints you pass explicitly;
preserve request, engine, GPU, launch, failure, cost, and cliff artifacts in structured schemas;
make every recommendation trace back to claim_status, claim_reason, and file-level evidence.

Commands

Command	What it does
`validate-completed`	Publishability gate; classifies a run as `synthetic_only`, `live_complete`, `live_incomplete`, `missing_required_artifacts`, or `not_publishable`.
`request-profile`	Per-request truth: TTFT, TPOT, E2E latency, tokens, HTTP status, errors, and per-field claim status.
`collect-metrics`	Normalized engine `/metrics` plus DCGM GPU timelines for live evidence.
`launch-engine`	Launch or externally validate vLLM, SGLang, LMCache, or Dynamo-SGLang and capture command/healthcheck artifacts.
`diagnose-bottleneck`	Classify one completed job as prefill, decode, KV, queue, network, host, launch, or not-enough-evidence.
`classify-failures`	Turn logs and artifacts into ranked operator-actionable failure classes.
`report-completed`	Produce refusal-gated operator recommendations from completed validation evidence.
`find-cliffs`	Detect capacity cliffs across completed sweeps.
`compute-cost`	Compute cost per useful task and safe concurrency envelopes.
`agentx-ingest` / `ingest-agentx`	Convert AgentX result CSVs into canonical InferGuard artifacts.
`simulate-gpu`	Generate synthetic GPU/Slurm artifacts for local bundle smoke tests.
`serve-mimic`	Run a tiny fake OpenAI-compatible endpoint for local demos.
`preflight`	Run read-only launch compatibility and tokenizer mismatch checks before paid traffic.
`analyze`	Analyze existing InferGuard, InferenceX, AgentX, or eval result directories.
`bench ...`	Replay traces, run KVCast/KV stress, compare runs, and run upstream-compatible benchmark modes.
`disagg status`	Scrape prefill/decode/transfer Prometheus endpoints and emit disaggregated-serving findings.
`profile live` / `profile retro`	Observe existing `/metrics` traffic or inspect saved live-profile artifacts.
`agent trace`	Capture local `agent-trace/v1` DAG events for supported agent frameworks.
`daemon ...`	Local harness sidecar and multi-node leader/follower fan-in.
`telemetry ...`	Local-only telemetry consent and payload audit commands; telemetry is disabled by default.
`workload analyze`	Pre-flight workload fingerprinting for routing and reporting.
`router classify`	Rule-based execution-path routing from workload fingerprints.
`emit-bundle`	Emit a deployment bundle from a router verdict.

See CLI reference for full --help output for every command and subcommand.

Hardware coverage

InferGuard ships with the DSv4 6-SKU capability matrix: H100, H200, B200, B300, GB200, and GB300 × DSv4 Flash/Pro × vLLM/SGLang × long-context chat/coding = 48 cells. Each cell is honestly classified:

WORKING_TEMPLATE (28 cells)
INFEASIBLE_DOCUMENTED (4 cells: H100 × DSv4-Pro single-node)
FUTURE_EXTERNAL (16 cells: GB200/GB300, awaiting rack-level external access)

See hardware coverage for the full matrix and status definitions.

Documentation

Claim status discipline

InferGuard never lies about what it measured. Every publishable artifact uses the canonical claim_status enum:

Value	Meaning
`synthetic`	No real GPU evidence; dry-run or synthetic mimic only.
`inferred`	Indirect evidence; read `claim_reason` or `claim_caveat` before quoting.
`measured`	Live evidence with the required artifact set.
`not_proven`	Claim could not be verified.

live_complete requires five gates:

non-empty request-profile rows;
at least one successful request;
launch healthcheck with status code 200 or an equivalent success status;
non-empty engine metrics timeline with recognized live engine metrics;
non-empty GPU metrics timeline with required DCGM signals.

If any gate is missing, InferGuard downgrades the claim instead of filling the gap with guesses.

Privacy and network behavior

InferGuard has zero telemetry by default. CLI network calls happen only to endpoints passed with flags such as --endpoint, --engine-metrics-url, --dcgm-metrics-url, --prefill, or --decode. Telemetry commands are local audit/consent tooling; hard overrides such as INFERGUARD_TELEMETRY=disabled and DO_NOT_TRACK=1 are honored.

Examples

License

Apache-2.0. See LICENSE.

Citation

If you use InferGuard in academic work, please cite:

@software{inferguard2026,
  author = {Chen, William},
  title = {InferGuard: Read-only disaggregated-serving diagnostics for vLLM, SGLang, Dynamo, and llm-d},
  year = {2026},
  url = {https://github.com/OCWC22/inferguard},
  version = {0.7.1}
}

See CITATION.cff.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.7.4

May 11, 2026

0.7.3

May 8, 2026

This version

0.7.1

May 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inferguard-0.7.1.tar.gz (749.6 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inferguard-0.7.1-py3-none-any.whl (339.4 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file inferguard-0.7.1.tar.gz.

File metadata

Download URL: inferguard-0.7.1.tar.gz
Upload date: May 5, 2026
Size: 749.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for inferguard-0.7.1.tar.gz
Algorithm	Hash digest
SHA256	`c09bfc4b562ce34100fbb0e9b7c8cf86584865135af9c5e4ccf0f933ba2b1d83`
MD5	`29d0fb6efdd7deee24185da28b1768a4`
BLAKE2b-256	`267c034e58b62383a60360375a0767a0ad3883261929493a32a3653c40b25de8`

See more details on using hashes here.

File details

Details for the file inferguard-0.7.1-py3-none-any.whl.

File metadata

Download URL: inferguard-0.7.1-py3-none-any.whl
Upload date: May 5, 2026
Size: 339.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for inferguard-0.7.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2a486a4a6a164990e37aba688703a8eb7cf77821643ce3b25c075ed88d1ff313`
MD5	`700329f7b020efb23d96131efc267677`
BLAKE2b-256	`885a63e35ef3a8b1e1f95b580ea6c33219e72f8509490fe187944129d750b39f`

See more details on using hashes here.

inferguard 0.7.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

InferGuard

What is this?

Quick start (60 seconds)

Why InferGuard?

Commands

Hardware coverage

Documentation

Claim status discipline

Privacy and network behavior

Examples

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes