Skip to main content

Read-only disaggregated-serving diagnostics for vLLM, SGLang, Dynamo, and llm-d.

Project description

InferGuard

PyPI Python License CodeQL

Read-only disaggregated-serving diagnostics for vLLM, SGLang, Dynamo, and llm-d.

What is this?

InferGuard is a source-available CLI and MCP server for validating inference benchmark evidence, profiling OpenAI-compatible endpoints, collecting engine/GPU timelines, and turning completed runs into refusal-gated operator reports. It is built for engineers running production-like vLLM, SGLang, Dynamo, LMCache, and llm-d stacks on GPU fleets where incomplete evidence is worse than no evidence. InferGuard does not promise every model fits every GPU. It tells the operator what fits, what fails, why it fails, and what hardware/config to use next.

InferGuard is distributed under the Business Source License 1.1 (BUSL-1.1). The Additional Use Grant allows teams to use InferGuard in their own source repositories, CI/CD, staging, internal tools, and internal production environments to benchmark, monitor, diagnose, validate, or optimize inference workloads they own, operate, or are authorized to evaluate. Offering InferGuard as a paid or hosted observability, benchmarking, diagnostics, optimization, inference operations, managed service, SaaS, or substantially similar competing commercial product requires a separate commercial license from Touchdown Labs. Each covered version converts to Apache-2.0 on the Change Date specified in LICENSE, or earlier if required by the BSL 1.1 terms.

Quick start (60 seconds)

pip install inferguard

# Generate a local synthetic GPU bundle for smoke testing.
inferguard simulate-gpu --results-root /tmp/inferguard-smoke --hardware b200 --engine vllm

# Validate a completed run. Synthetic smoke tests intentionally do not pass --strict.
inferguard validate-completed --results-root /tmp/inferguard-smoke || true

# Profile per-request latency against an OpenAI-compatible endpoint.
cat >/tmp/inferguard-requests.jsonl <<'JSONL'
{"request_id":"doc-001","messages":[{"role":"user","content":"Reply with one short sentence about InferGuard."}],"max_tokens":24}
JSONL

inferguard request-profile \
  --output-dir /tmp/inferguard-profile \
  --endpoint http://localhost:8000/v1/chat/completions \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --input-jsonl /tmp/inferguard-requests.jsonl \
  --concurrency 1 \
  --stream

# Diagnose a completed job directory once request, launch, metrics, and validation artifacts exist.
inferguard diagnose-bottleneck --job-dir /path/to/results/jobs/<job-id>

From a source checkout, replace inferguard with PYTHONPATH=src python3 -m inferguard.cli.

Why InferGuard?

NeoCloud and platform engineers need honest evidence for DSv4-class serving stacks on H100, H200, B200, B300, GB200, and GB300. Most benchmark wrappers are happy to emit a report even when the request rows are empty, the healthcheck failed, DCGM was missing, or the model never actually fit in HBM.

InferGuard's bias is the opposite:

  • refuse or downgrade when required artifacts are missing;
  • separate synthetic smoke tests from live evidence;
  • keep network behavior limited to endpoints you pass explicitly;
  • preserve request, engine, GPU, launch, failure, cost, and cliff artifacts in structured schemas;
  • make every recommendation trace back to claim_status, claim_reason, and file-level evidence.

Commands

Command What it does
validate-completed Publishability gate; classifies a run as synthetic_only, live_complete, live_incomplete, missing_required_artifacts, or not_publishable.
request-profile Per-request truth: TTFT, TPOT, E2E latency, tokens, HTTP status, errors, and per-field claim status.
collect-metrics Normalized engine /metrics plus DCGM GPU timelines for live evidence.
launch-engine Launch or externally validate vLLM, SGLang, LMCache, or Dynamo-SGLang and capture command/healthcheck artifacts.
diagnose-bottleneck Classify one completed job as prefill, decode, KV, queue, network, host, launch, or not-enough-evidence.
classify-failures Turn logs and artifacts into ranked operator-actionable failure classes.
report-completed Produce refusal-gated operator recommendations from completed validation evidence.
find-cliffs Detect capacity cliffs across completed sweeps.
compute-cost Compute cost per useful task and safe concurrency envelopes.
agentx-ingest / ingest-agentx Convert AgentX result CSVs into canonical InferGuard artifacts.
simulate-gpu Generate synthetic GPU/Slurm artifacts for local bundle smoke tests.
serve-mimic Run a tiny fake OpenAI-compatible endpoint for local demos.
preflight Run read-only launch compatibility and tokenizer mismatch checks before paid traffic.
analyze Analyze existing InferGuard, InferenceX, AgentX, or eval result directories.
bench ... Replay traces, run KVCast/KV stress, compare runs, and run upstream-compatible benchmark modes.
disagg status Scrape prefill/decode/transfer Prometheus endpoints and emit disaggregated-serving findings.
profile live / profile retro Observe existing /metrics traffic or inspect saved live-profile artifacts.
agent trace Capture local agent-trace/v1 DAG events for supported agent frameworks.
daemon ... Local harness sidecar and multi-node leader/follower fan-in.
telemetry ... Local-only telemetry consent and payload audit commands; telemetry is disabled by default.
workload analyze Pre-flight workload fingerprinting for routing and reporting.
router classify Rule-based execution-path routing from workload fingerprints.
emit-bundle Emit a deployment bundle from a router verdict.

See the CLI reference for full --help output for every command and subcommand.

Hardware coverage

InferGuard ships with the DSv4 6-SKU capability matrix: H100, H200, B200, B300, GB200, and GB300 × DSv4 Flash/Pro × vLLM/SGLang × long-context chat/coding = 48 cells. Each cell is honestly classified:

  • WORKING_TEMPLATE (28 cells)
  • INFEASIBLE_DOCUMENTED (4 cells: H100 × DSv4-Pro single-node)
  • FUTURE_EXTERNAL (16 cells: GB200/GB300, awaiting rack-level external access)

See hardware coverage for the full matrix and status definitions.

Documentation

Claim status discipline

InferGuard never lies about what it measured. Every publishable artifact uses the canonical claim_status enum:

Value Meaning
synthetic No real GPU evidence; dry-run or synthetic mimic only.
inferred Indirect evidence; read claim_reason or claim_caveat before quoting.
measured Live evidence with the required artifact set.
not_proven Claim could not be verified.

live_complete requires five gates:

  1. non-empty request-profile rows;
  2. at least one successful request;
  3. launch healthcheck with status code 200 or an equivalent success status;
  4. non-empty engine metrics timeline with recognized live engine metrics;
  5. non-empty GPU metrics timeline with required DCGM signals.

If any gate is missing, InferGuard downgrades the claim instead of filling the gap with guesses.

Privacy and network behavior

InferGuard has zero telemetry by default. CLI network calls happen only to endpoints passed with flags such as --endpoint, --engine-metrics-url, --dcgm-metrics-url, --prefill, or --decode. Telemetry commands are local audit/consent tooling; hard overrides such as INFERGUARD_TELEMETRY=disabled and DO_NOT_TRACK=1 are honored.

Examples

License

Business Source License 1.1 (BUSL-1.1). See LICENSE.

Citation

If you use InferGuard in academic work, please cite:

@software{inferguard2026,
  author = {Chen, William},
  title = {InferGuard: Read-only disaggregated-serving diagnostics for vLLM, SGLang, Dynamo, and llm-d},
  year = {2026},
  url = {https://github.com/OCWC22/inferguard},
  version = {0.7.1}
}

See CITATION.cff.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inferguard-0.7.4.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inferguard-0.7.4-py3-none-any.whl (401.8 kB view details)

Uploaded Python 3

File details

Details for the file inferguard-0.7.4.tar.gz.

File metadata

  • Download URL: inferguard-0.7.4.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for inferguard-0.7.4.tar.gz
Algorithm Hash digest
SHA256 957195126a8d3241a002c43649c68868d5031190b6bab649fc0fe8495ab75339
MD5 f589c2bb37018e9dfaa8d38018d18cce
BLAKE2b-256 69ce3ee818bf3253f27f0325a72c5e7561fb0df9e36bb3c910d808c0503f36f7

See more details on using hashes here.

File details

Details for the file inferguard-0.7.4-py3-none-any.whl.

File metadata

  • Download URL: inferguard-0.7.4-py3-none-any.whl
  • Upload date:
  • Size: 401.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for inferguard-0.7.4-py3-none-any.whl
Algorithm Hash digest
SHA256 01da01103b947b4d47f79ec5bb1431ea8c0ba09d56273ab41c8a38d059ba0b78
MD5 3fae628a976c3fcd62b9eb6ffc435f60
BLAKE2b-256 978ec82586c99c2011a27b212384568abb439f07ce3a66e0ebb6828596186480

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page