Read-only disaggregated-serving diagnostics for vLLM, SGLang, Dynamo, and llm-d.
Project description
InferGuard
Read-only disaggregated-serving diagnostics for vLLM, SGLang, Dynamo, and llm-d.
What is this?
InferGuard is a source-available CLI and MCP server for validating inference benchmark evidence, profiling OpenAI-compatible endpoints, collecting engine/GPU timelines, and turning completed runs into refusal-gated operator reports. It is built for engineers running production-like vLLM, SGLang, Dynamo, LMCache, and llm-d stacks on GPU fleets where incomplete evidence is worse than no evidence. InferGuard does not promise every model fits every GPU. It tells the operator what fits, what fails, why it fails, and what hardware/config to use next.
InferGuard is distributed under the Business Source License 1.1 (BUSL-1.1). The Additional Use Grant allows teams to use InferGuard in their own source repositories, CI/CD, staging, internal tools, and internal production environments to benchmark, monitor, diagnose, validate, or optimize inference workloads they own, operate, or are authorized to evaluate. Offering InferGuard as a paid or hosted observability, benchmarking, diagnostics, optimization, inference operations, managed service, SaaS, or substantially similar competing commercial product requires a separate commercial license from Touchdown Labs. Each covered version converts to Apache-2.0 on the Change Date specified in LICENSE, or earlier if required by the BSL 1.1 terms.
Quick start (60 seconds)
pip install inferguard
# Generate a local synthetic GPU bundle for smoke testing.
inferguard simulate-gpu --results-root /tmp/inferguard-smoke --hardware b200 --engine vllm
# Validate a completed run. Synthetic smoke tests intentionally do not pass --strict.
inferguard validate-completed --results-root /tmp/inferguard-smoke || true
# Profile per-request latency against an OpenAI-compatible endpoint.
cat >/tmp/inferguard-requests.jsonl <<'JSONL'
{"request_id":"doc-001","messages":[{"role":"user","content":"Reply with one short sentence about InferGuard."}],"max_tokens":24}
JSONL
inferguard request-profile \
--output-dir /tmp/inferguard-profile \
--endpoint http://localhost:8000/v1/chat/completions \
--model deepseek-ai/DeepSeek-V4-Flash \
--input-jsonl /tmp/inferguard-requests.jsonl \
--concurrency 1 \
--stream
# Diagnose a completed job directory once request, launch, metrics, and validation artifacts exist.
inferguard diagnose-bottleneck --job-dir /path/to/results/jobs/<job-id>
From a source checkout, replace inferguard with PYTHONPATH=src python3 -m inferguard.cli.
Why InferGuard?
NeoCloud and platform engineers need honest evidence for DSv4-class serving stacks on H100, H200, B200, B300, GB200, and GB300. Most benchmark wrappers are happy to emit a report even when the request rows are empty, the healthcheck failed, DCGM was missing, or the model never actually fit in HBM.
InferGuard's bias is the opposite:
- refuse or downgrade when required artifacts are missing;
- separate synthetic smoke tests from live evidence;
- keep network behavior limited to endpoints you pass explicitly;
- preserve request, engine, GPU, launch, failure, cost, and cliff artifacts in structured schemas;
- make every recommendation trace back to
claim_status,claim_reason, and file-level evidence.
Commands
| Command | What it does |
|---|---|
validate-completed |
Publishability gate; classifies a run as synthetic_only, live_complete, live_incomplete, missing_required_artifacts, or not_publishable. |
request-profile |
Per-request truth: TTFT, TPOT, E2E latency, tokens, HTTP status, errors, and per-field claim status. |
collect-metrics |
Normalized engine /metrics plus DCGM GPU timelines for live evidence. |
launch-engine |
Launch or externally validate vLLM, SGLang, LMCache, or Dynamo-SGLang and capture command/healthcheck artifacts. |
diagnose-bottleneck |
Classify one completed job as prefill, decode, KV, queue, network, host, launch, or not-enough-evidence. |
classify-failures |
Turn logs and artifacts into ranked operator-actionable failure classes. |
report-completed |
Produce refusal-gated operator recommendations from completed validation evidence. |
find-cliffs |
Detect capacity cliffs across completed sweeps. |
compute-cost |
Compute cost per useful task and safe concurrency envelopes. |
agentx-ingest / ingest-agentx |
Convert AgentX result CSVs into canonical InferGuard artifacts. |
simulate-gpu |
Generate synthetic GPU/Slurm artifacts for local bundle smoke tests. |
serve-mimic |
Run a tiny fake OpenAI-compatible endpoint for local demos. |
preflight |
Run read-only launch compatibility and tokenizer mismatch checks before paid traffic. |
analyze |
Analyze existing InferGuard, InferenceX, AgentX, or eval result directories. |
bench ... |
Replay traces, run KVCast/KV stress, compare runs, and run upstream-compatible benchmark modes. |
disagg status |
Scrape prefill/decode/transfer Prometheus endpoints and emit disaggregated-serving findings. |
profile live / profile retro |
Observe existing /metrics traffic or inspect saved live-profile artifacts. |
agent trace |
Capture local agent-trace/v1 DAG events for supported agent frameworks. |
daemon ... |
Local harness sidecar and multi-node leader/follower fan-in. |
telemetry ... |
Local-only telemetry consent and payload audit commands; telemetry is disabled by default. |
workload analyze |
Pre-flight workload fingerprinting for routing and reporting. |
router classify |
Rule-based execution-path routing from workload fingerprints. |
emit-bundle |
Emit a deployment bundle from a router verdict. |
See the CLI reference for full --help output for every command and subcommand.
Hardware coverage
InferGuard ships with the DSv4 6-SKU capability matrix: H100, H200, B200, B300, GB200, and GB300 × DSv4 Flash/Pro × vLLM/SGLang × long-context chat/coding = 48 cells. Each cell is honestly classified:
WORKING_TEMPLATE(28 cells)INFEASIBLE_DOCUMENTED(4 cells: H100 × DSv4-Pro single-node)FUTURE_EXTERNAL(16 cells: GB200/GB300, awaiting rack-level external access)
See hardware coverage for the full matrix and status definitions.
Documentation
- Documentation
- Architecture
- CLI reference
- Hardware coverage
- Schemas
- Examples
- Troubleshooting
- Contributing
- Security policy
- Code of conduct
- Changelog
Claim status discipline
InferGuard never lies about what it measured. Every publishable artifact uses the canonical claim_status enum:
| Value | Meaning |
|---|---|
synthetic |
No real GPU evidence; dry-run or synthetic mimic only. |
inferred |
Indirect evidence; read claim_reason or claim_caveat before quoting. |
measured |
Live evidence with the required artifact set. |
not_proven |
Claim could not be verified. |
live_complete requires five gates:
- non-empty request-profile rows;
- at least one successful request;
- launch healthcheck with status code
200or an equivalent success status; - non-empty engine metrics timeline with recognized live engine metrics;
- non-empty GPU metrics timeline with required DCGM signals.
If any gate is missing, InferGuard downgrades the claim instead of filling the gap with guesses.
Privacy and network behavior
InferGuard has zero telemetry by default. CLI network calls happen only to endpoints passed with flags such as --endpoint, --engine-metrics-url, --dcgm-metrics-url, --prefill, or --decode. Telemetry commands are local audit/consent tooling; hard overrides such as INFERGUARD_TELEMETRY=disabled and DO_NOT_TRACK=1 are honored.
Examples
License
Business Source License 1.1 (BUSL-1.1). See LICENSE.
Citation
If you use InferGuard in academic work, please cite:
@software{inferguard2026,
author = {Chen, William},
title = {InferGuard: Read-only disaggregated-serving diagnostics for vLLM, SGLang, Dynamo, and llm-d},
year = {2026},
url = {https://github.com/OCWC22/inferguard},
version = {0.7.1}
}
See CITATION.cff.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inferguard-0.7.4.tar.gz.
File metadata
- Download URL: inferguard-0.7.4.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
957195126a8d3241a002c43649c68868d5031190b6bab649fc0fe8495ab75339
|
|
| MD5 |
f589c2bb37018e9dfaa8d38018d18cce
|
|
| BLAKE2b-256 |
69ce3ee818bf3253f27f0325a72c5e7561fb0df9e36bb3c910d808c0503f36f7
|
File details
Details for the file inferguard-0.7.4-py3-none-any.whl.
File metadata
- Download URL: inferguard-0.7.4-py3-none-any.whl
- Upload date:
- Size: 401.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01da01103b947b4d47f79ec5bb1431ea8c0ba09d56273ab41c8a38d059ba0b78
|
|
| MD5 |
3fae628a976c3fcd62b9eb6ffc435f60
|
|
| BLAKE2b-256 |
978ec82586c99c2011a27b212384568abb439f07ce3a66e0ebb6828596186480
|