Skip to main content

Read-only disaggregated-serving diagnostics for vLLM, SGLang, Dynamo, and llm-d.

Project description

InferGuard

PyPI Python License CodeQL

Read-only disaggregated-serving diagnostics for vLLM, SGLang, Dynamo, and llm-d.

What is this?

InferGuard is an OSS CLI and MCP server for validating inference benchmark evidence, profiling OpenAI-compatible endpoints, collecting engine/GPU timelines, and turning completed runs into refusal-gated operator reports. It is built for engineers running production-like vLLM, SGLang, Dynamo, LMCache, and llm-d stacks on GPU fleets where incomplete evidence is worse than no evidence. InferGuard does not promise every model fits every GPU. It tells the operator what fits, what fails, why it fails, and what hardware/config to use next.

Quick start (60 seconds)

pip install inferguard

# Generate a local synthetic GPU bundle for smoke testing.
inferguard simulate-gpu --results-root /tmp/inferguard-smoke --hardware b200 --engine vllm

# Validate a completed run. Synthetic smoke tests intentionally do not pass --strict.
inferguard validate-completed --results-root /tmp/inferguard-smoke || true

# Profile per-request latency against an OpenAI-compatible endpoint.
cat >/tmp/inferguard-requests.jsonl <<'JSONL'
{"request_id":"doc-001","messages":[{"role":"user","content":"Reply with one short sentence about InferGuard."}],"max_tokens":24}
JSONL

inferguard request-profile \
  --output-dir /tmp/inferguard-profile \
  --endpoint http://localhost:8000/v1/chat/completions \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --input-jsonl /tmp/inferguard-requests.jsonl \
  --concurrency 1 \
  --stream

# Diagnose a completed job directory once request, launch, metrics, and validation artifacts exist.
inferguard diagnose-bottleneck --job-dir /path/to/results/jobs/<job-id>

From a source checkout, replace inferguard with PYTHONPATH=src python3 -m inferguard.cli.

Why InferGuard?

NeoCloud and platform engineers need honest evidence for DSv4-class serving stacks on H100, H200, B200, B300, GB200, and GB300. Most benchmark wrappers are happy to emit a report even when the request rows are empty, the healthcheck failed, DCGM was missing, or the model never actually fit in HBM.

InferGuard's bias is the opposite:

  • refuse or downgrade when required artifacts are missing;
  • separate synthetic smoke tests from live evidence;
  • keep network behavior limited to endpoints you pass explicitly;
  • preserve request, engine, GPU, launch, failure, cost, and cliff artifacts in structured schemas;
  • make every recommendation trace back to claim_status, claim_reason, and file-level evidence.

Commands

Command What it does
validate-completed Publishability gate; classifies a run as synthetic_only, live_complete, live_incomplete, missing_required_artifacts, or not_publishable.
request-profile Per-request truth: TTFT, TPOT, E2E latency, tokens, HTTP status, errors, and per-field claim status.
collect-metrics Normalized engine /metrics plus DCGM GPU timelines for live evidence.
launch-engine Launch or externally validate vLLM, SGLang, LMCache, or Dynamo-SGLang and capture command/healthcheck artifacts.
diagnose-bottleneck Classify one completed job as prefill, decode, KV, queue, network, host, launch, or not-enough-evidence.
classify-failures Turn logs and artifacts into ranked operator-actionable failure classes.
report-completed Produce refusal-gated operator recommendations from completed validation evidence.
find-cliffs Detect capacity cliffs across completed sweeps.
compute-cost Compute cost per useful task and safe concurrency envelopes.
agentx-ingest / ingest-agentx Convert AgentX result CSVs into canonical InferGuard artifacts.
simulate-gpu Generate synthetic GPU/Slurm artifacts for local bundle smoke tests.
serve-mimic Run a tiny fake OpenAI-compatible endpoint for local demos.
preflight Run read-only launch compatibility and tokenizer mismatch checks before paid traffic.
analyze Analyze existing InferGuard, InferenceX, AgentX, or eval result directories.
bench ... Replay traces, run KVCast/KV stress, compare runs, and run upstream-compatible benchmark modes.
disagg status Scrape prefill/decode/transfer Prometheus endpoints and emit disaggregated-serving findings.
profile live / profile retro Observe existing /metrics traffic or inspect saved live-profile artifacts.
agent trace Capture local agent-trace/v1 DAG events for supported agent frameworks.
daemon ... Local harness sidecar and multi-node leader/follower fan-in.
telemetry ... Local-only telemetry consent and payload audit commands; telemetry is disabled by default.
workload analyze Pre-flight workload fingerprinting for routing and reporting.
router classify Rule-based execution-path routing from workload fingerprints.
emit-bundle Emit a deployment bundle from a router verdict.

See CLI reference for full --help output for every command and subcommand.

Hardware coverage

InferGuard ships with the DSv4 6-SKU capability matrix: H100, H200, B200, B300, GB200, and GB300 × DSv4 Flash/Pro × vLLM/SGLang × long-context chat/coding = 48 cells. Each cell is honestly classified:

  • WORKING_TEMPLATE (28 cells)
  • INFEASIBLE_DOCUMENTED (4 cells: H100 × DSv4-Pro single-node)
  • FUTURE_EXTERNAL (16 cells: GB200/GB300, awaiting rack-level external access)

See hardware coverage for the full matrix and status definitions.

Documentation

Claim status discipline

InferGuard never lies about what it measured. Every publishable artifact uses the canonical claim_status enum:

Value Meaning
synthetic No real GPU evidence; dry-run or synthetic mimic only.
inferred Indirect evidence; read claim_reason or claim_caveat before quoting.
measured Live evidence with the required artifact set.
not_proven Claim could not be verified.

live_complete requires five gates:

  1. non-empty request-profile rows;
  2. at least one successful request;
  3. launch healthcheck with status code 200 or an equivalent success status;
  4. non-empty engine metrics timeline with recognized live engine metrics;
  5. non-empty GPU metrics timeline with required DCGM signals.

If any gate is missing, InferGuard downgrades the claim instead of filling the gap with guesses.

Privacy and network behavior

InferGuard has zero telemetry by default. CLI network calls happen only to endpoints passed with flags such as --endpoint, --engine-metrics-url, --dcgm-metrics-url, --prefill, or --decode. Telemetry commands are local audit/consent tooling; hard overrides such as INFERGUARD_TELEMETRY=disabled and DO_NOT_TRACK=1 are honored.

Examples

License

Apache-2.0. See LICENSE.

Citation

If you use InferGuard in academic work, please cite:

@software{inferguard2026,
  author = {Chen, William},
  title = {InferGuard: Read-only disaggregated-serving diagnostics for vLLM, SGLang, Dynamo, and llm-d},
  year = {2026},
  url = {https://github.com/OCWC22/inferguard},
  version = {0.7.1}
}

See CITATION.cff.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inferguard-0.7.1.tar.gz (749.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inferguard-0.7.1-py3-none-any.whl (339.4 kB view details)

Uploaded Python 3

File details

Details for the file inferguard-0.7.1.tar.gz.

File metadata

  • Download URL: inferguard-0.7.1.tar.gz
  • Upload date:
  • Size: 749.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for inferguard-0.7.1.tar.gz
Algorithm Hash digest
SHA256 c09bfc4b562ce34100fbb0e9b7c8cf86584865135af9c5e4ccf0f933ba2b1d83
MD5 29d0fb6efdd7deee24185da28b1768a4
BLAKE2b-256 267c034e58b62383a60360375a0767a0ad3883261929493a32a3653c40b25de8

See more details on using hashes here.

File details

Details for the file inferguard-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: inferguard-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 339.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for inferguard-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2a486a4a6a164990e37aba688703a8eb7cf77821643ce3b25c075ed88d1ff313
MD5 700329f7b020efb23d96131efc267677
BLAKE2b-256 885a63e35ef3a8b1e1f95b580ea6c33219e72f8509490fe187944129d750b39f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page