Open-source inference sweep for llama.cpp and vLLM: TPS, TTFT, ITL, and PPL across 16 configs.
Project description
sigilant-sweep
Evaluation orchestration for inference stacks (llama.cpp, vLLM) with Local and Modal backends: TPS, TTFT, ITL, PPL proxy, and artifacted comparisons.
Scope • Install • Run paths • Metrics • Reproducibility
Scope
sigilant-sweep orchestrates config sweeps and reporting on top of existing inference engines.
It provides:
- config generation
- execution via adapters (
llama.cpp,vllm) - metric parsing (TPS, TTFT, ITL, PPL proxy)
- scoring and artifact export
It is not a new inference runtime.
Why use this instead of running one-off engine commands
- Runs a full config grid (
quant × context × KV) with consistent run settings. - Uses trial-first rotated execution to reduce ordering bias across configs.
- Ranks configs on a composite score (TPS, TTFT, PPL proxy), not a single metric.
- Supports depth passes (8k/14k/28k prompts) for context-window behavior checks.
- Adds a structured-output smoke gate for quick post-ranking sanity checks.
- Exports reproducible artifacts (
json,md,svg, terminal log) for review and sharing.
Not in scope
- custom kernels or scheduler innovation
- replacing engine internals (
llama.cpp,vllm) - claiming production safety certification from throughput measurements
Install
# Refresh installer tooling first (recommended)
python3 -m pip install -U pip
# Base (lightweight CLI + reporting)
pip install sigilant-sweep
# Hugging Face integration only
pip install 'sigilant-sweep[hf]'
# llama-cpp-python fallback only — not needed if llama-cli is on PATH
pip install 'sigilant-sweep[llama]'
# With llama-cpp-python fallback + CUDA acceleration
CMAKE_ARGS="-DGGML_CUDA=on" pip install 'sigilant-sweep[llama]'
# With vLLM (Linux + CUDA only)
pip install 'sigilant-sweep[vllm]'
# With Modal cloud backend
pip install 'sigilant-sweep[modal]'
# Everything
pip install 'sigilant-sweep[all]'
If your pip config points to a private/stale mirror, force official PyPI:
pip install --index-url https://pypi.org/simple sigilant-sweep
Run paths
Use one of these four paths:
1) Local + llama.cpp
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip setuptools wheel
pip install sigilant-sweep
Local llama.cpp execution uses llama-cli binary by default.
If llama-cli is not on PATH, set it explicitly:
export SIGILANT_LLAMA_CLI=/abs/path/to/llama-cli
If you do not have a llama-cli binary, install Python fallback:
pip install "sigilant-sweep[llama]"
Sanity run:
sigilant-sweep run \
--model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
--backend local \
--engine llama.cpp \
--configs 1 \
--trials 1
2) Local + vLLM (Linux + CUDA)
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip setuptools wheel
pip install "sigilant-sweep[vllm]"
Set family repo IDs (required for full-family runs):
export SIGILANT_VLLM_FP16_BASELINE_REPO="microsoft/Phi-3.5-mini-instruct"
export SIGILANT_VLLM_INT8_W8A8_REPO="anhbn/Phi-3.5-mini-instruct-quantized.w8a8"
export SIGILANT_VLLM_AWQ4_MARLIN_REPO="thesven/Phi-3.5-mini-instruct-awq"
export SIGILANT_VLLM_GPTQ4_MARLIN_REPO="thesven/Phi-3.5-mini-instruct-GPTQ-4bit"
Sanity run:
sigilant-sweep run \
--model microsoft/Phi-3.5-mini-instruct \
--backend local \
--engine vllm \
--configs 1 \
--trials 1
3) Modal + llama.cpp
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip setuptools wheel
pip install "sigilant-sweep[modal]"
modal token new
sigilant-sweep info
Sanity run:
sigilant-sweep run \
--model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
--backend modal \
--engine llama.cpp \
--hardware l4 \
--configs 1 \
--trials 1
4) Modal + vLLM
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip setuptools wheel
pip install "sigilant-sweep[modal]"
modal token new
Set family repo IDs (required for full-family runs):
unset SIGILANT_VLLM_FAMILY_REPOS
export SIGILANT_VLLM_FP16_BASELINE_REPO="microsoft/Phi-3.5-mini-instruct"
export SIGILANT_VLLM_INT8_W8A8_REPO="anhbn/Phi-3.5-mini-instruct-quantized.w8a8"
export SIGILANT_VLLM_AWQ4_MARLIN_REPO="thesven/Phi-3.5-mini-instruct-awq"
export SIGILANT_VLLM_GPTQ4_MARLIN_REPO="thesven/Phi-3.5-mini-instruct-GPTQ-4bit"
Sanity run:
sigilant-sweep run \
--model microsoft/Phi-3.5-mini-instruct \
--backend modal \
--engine vllm \
--hardware l4 \
--configs 1 \
--trials 1
Intel macOS note (Modal extras)
If you see Failed building wheel for cbor2:
pip uninstall -y modal cbor2
pip install --only-binary=:all: "cbor2==5.6.5"
pip install "sigilant-sweep[modal]"
Then verify:
python3 -c "import modal, cbor2; print('modal', modal.__version__, 'cbor2_ok', hasattr(cbor2, 'dumps'))"
Quick start
# 1. Check hardware and credentials
sigilant-sweep setup
# 2. Show what's detected on this machine
sigilant-sweep info
# 3. Run a sweep (local GPU, llama.cpp)
sigilant-sweep run --model mistralai/Mistral-7B-Instruct-v0.3
# 4. Save results to JSON
sigilant-sweep run --model mistralai/Mistral-7B-Instruct-v0.3 --json
Example: Modal run (llama.cpp)
sigilant-sweep run \
--model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
--backend modal \
--engine llama.cpp \
--hardware l4 \
--score-profile balanced \
--agent-smoke
Expected output:
- ranked config table
- recommended config + baseline delta
- artifacts:
sigilant_results.json,sigilant_summary.md,sigilant_frontier.svg,sigilant_terminal.txt
Example output (truncated):
Config TPS TTFT ITL PPL Score
──────────────────────────────────────────────────────────────────────────────────────
Q4_K_M · ctx:16384 · kv:k16v16 · long ← best 74.1 1728ms 13.5ms 14.32 97
Q4_K_M · ctx:8192 · kv:k16v16 · default 74.0 1729ms 13.5ms 14.32 97
Q5_K_M · ctx:8192 · kv:k16v16 · default 71.4 1792ms 14.0ms 13.61 97
Best config: Q4_K_M · ctx:16384 · kv:k16v16 · long
Auto baseline compare (auto:max_precision(Q8_0)): score Δ=+6.00 TPS Δ=+8.20 TTFT Δ=-233.9ms PPL Δ=+0.19
Artifacts: artifacts/runs/20260524_171722/sigilant_results.json,
artifacts/runs/20260524_171722/sigilant_summary.md,
artifacts/runs/20260524_171722/sigilant_frontier.svg,
artifacts/runs/20260524_171722/sigilant_terminal.txt
Example artifacts bundle:
artifacts/runs/20260524_171722/
├── sigilant_results.json
├── sigilant_summary.md
├── sigilant_frontier.svg
└── sigilant_terminal.txt
Live run examples
Full vLLM sweep example (Modal, L4):
Depth profile example (8k/14k/28k passes):
Notes:
- Captures below are from real runs of this repository.
- Results vary by model, prompt set, hardware, and backend.
Run notes:
- Default
--trialsis 12. - Lower
--trialsfor faster/cheaper sweeps; increase for stability. - Artifacts include confidence inputs (for example top-2 gap).
Common run patterns
llama.cpp: single config
sigilant-sweep run \
--model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
--backend modal \
--engine llama.cpp \
--hardware l4 \
--configs 16 \
--trials 1 \
--only-config "Q4_K_M,8192,k16v16,default"
llama.cpp depth profile
sigilant-sweep run \
--model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
--backend modal \
--engine llama.cpp \
--hardware l4 \
--configs 16 \
--trials 5 \
--evaluation-mode depth_profile \
--depth-prompt-8k prompts/hard_quality_8k_prompt.txt \
--depth-prompt-14k prompts/hard_quality_14k_prompt.txt \
--depth-prompt-28k prompts/hard_quality_28k_prompt.txt
llama.cpp with structured-output smoke
sigilant-sweep run \
--model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
--backend modal \
--engine llama.cpp \
--hardware l4 \
--configs 16 \
--trials 5 \
--agent-smoke
vLLM: full-family sweep (Modal)
unset SIGILANT_VLLM_FAMILY_REPOS
export SIGILANT_VLLM_FP16_BASELINE_REPO="microsoft/Phi-3.5-mini-instruct"
export SIGILANT_VLLM_INT8_W8A8_REPO="anhbn/Phi-3.5-mini-instruct-quantized.w8a8"
export SIGILANT_VLLM_AWQ4_MARLIN_REPO="thesven/Phi-3.5-mini-instruct-awq"
export SIGILANT_VLLM_GPTQ4_MARLIN_REPO="thesven/Phi-3.5-mini-instruct-GPTQ-4bit"
sigilant-sweep run \
--model microsoft/Phi-3.5-mini-instruct \
--backend modal \
--engine vllm \
--hardware l4 \
--configs 16 \
--trials 1
Execution model
- CLI resolves model files, builds the config grid, dispatches to backend, and scores results.
- llama.cpp path runs timed generation and perplexity per config/trial, then aggregates (
p50,p95,mean PPL). - Multi-trial runs are rotated trial-first to avoid running all trials of one config back-to-back.
- Artifacts are written under
artifacts/runs/<run_id>/.
Troubleshooting
-
Model resolution failed: huggingface-hub is required: installpip install "sigilant-sweep[hf]"orpip install "sigilant-sweep[modal]". -
Error: modal is not installed: installpip install "sigilant-sweep[modal]". -
Version ... of modal is deprecated: upgrade modal in venv:pip install -U modal. -
Failed building wheel for cbor2(Intel macOS path) : runpip uninstall -y modal cbor2 && pip install --only-binary=:all: "cbor2==5.6.5" && pip install "sigilant-sweep[modal]". -
vLLM local failures on macOS/Windows : expected; run vLLM through Modal.
Hardware options
Backend location:
| Flag | Where it runs |
|---|---|
--backend local |
Your machine (default) |
--backend modal |
Modal cloud (your account) |
GPU targets:
--hardware value |
GPU | VRAM |
|---|---|---|
auto |
auto-detect | n/a |
a10g |
NVIDIA A10G | 24 GB |
a100 |
NVIDIA A100 | 40 GB |
h100 |
NVIDIA H100 | 80 GB |
l4 |
NVIDIA L4 | 24 GB |
t4 |
NVIDIA T4 | 16 GB |
rtx4090 |
RTX 4090 | 24 GB |
rtx3090 |
RTX 3090 | 24 GB |
rtxa6000 |
RTX A6000 | 48 GB |
Engine options
| Flag | Supported Backends | Notes |
|---|---|---|
--engine llama.cpp |
local, modal |
GGUF-based flow |
--engine vllm |
local, modal |
Linux + CUDA required |
Full CLI reference
sigilant-sweep run [OPTIONS]
--model -m HuggingFace repo ID or local .gguf path [required]
--backend -b local | modal [default: local]
--engine -e llama.cpp | vllm [default: llama.cpp]
--hardware GPU target (see table above) [default: auto]
--params-b Model size in billions (for VRAM estimate) [default: 7.0]
--configs Max number of configs to sweep [default: 16]
--confidence-target low | medium | high [default: medium] (reporting only)
--score-profile balanced | latency | quality [default: balanced]
--evaluation-mode ranking | depth_profile [default: ranking]
--depth-prompt-8k Path to 8k prompt file [default: prompts/hard_quality_8k_prompt.txt]
--depth-prompt-14k Path to 14k prompt file [default: prompts/hard_quality_14k_prompt.txt]
--depth-prompt-28k Path to 28k prompt file [default: prompts/hard_quality_28k_prompt.txt]
--only-config QUANT,CTX,KV,REGIME [optional]
--trials Trials per config [default: 12]
--json Also write results to sigilant_results.json
sigilant-sweep setup Check credentials for all backends (interactive)
sigilant-sweep info Show detected hardware and installed engines
sigilant-sweep --version
To check the exact options in your installed version:
sigilant-sweep --help
sigilant-sweep run --help
What this measures
| Metric | Description |
|---|---|
| TPS | Output tokens per second |
| TTFT | Time to first token (ms) |
| ITL | Inter-token latency (ms) |
| PPL | Perplexity on a fixed corpus, used as a lightweight quality proxy |
| Score | Sigilant composite (preset-based): balanced/latency/quality profiles |
What this does NOT measure
- Tool calling correctness
- Structured JSON / schema output validity
- Hallucination resistance
- Prompt injection resistance
- Long-context retrieval (NIAH)
PPL is a lightweight quality proxy. It is not a safety or capability evaluation.
Prompt corpus note:
- Prompt and corpus files in
prompts/are evaluation assets for this harness. - They are for relative config comparison, not a standard external evaluation set.
Verification and reproducibility
- Keep raw artifacts with reported tables (
sigilant_results.json,sigilant_terminal.txt). - Re-run top candidates with
--only-configbefore final selection:
sigilant-sweep run \
--model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
--backend modal \
--engine llama.cpp \
--hardware l4 \
--configs 16 \
--trials 3 \
--only-config "Q4_K_M,16384,k16v16,long"
- Separate infra/control-plane failures from model/runtime failures.
- Treat PPL as a ranking proxy within comparable runs.
PPL corpus note:
- The default PPL corpus is lightweight and coarse.
- Close winners may need higher trials and/or a larger domain-specific corpus.
Boundary:
- OSS
sigilant-sweep: config ranking, runtime metrics, and lightweight smoke triage. - For broader capability/safety validation on production workloads, use Sigilant Optimizer.
Score profiles
balanced:40% TPS + 20% TTFT + 40% PPLlatency:50% TPS + 30% TTFT + 20% PPLquality:30% TPS + 20% TTFT + 50% PPL
If PPL is unavailable, TPS/TTFT weights are renormalized automatically.
License
Apache 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sigilant_sweep-0.1.8.tar.gz.
File metadata
- Download URL: sigilant_sweep-0.1.8.tar.gz
- Upload date:
- Size: 996.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e1bc38606caebcb968bb0a0125b37b69a02d7c0f7ee6c72efcf5bdd9b992d16
|
|
| MD5 |
90aa7089b693e74640b3da36c3632333
|
|
| BLAKE2b-256 |
b82dba214f2e15261c9a0685e3dda0892afa816c5311c54672556a9694c35237
|
Provenance
The following attestation bundles were made for sigilant_sweep-0.1.8.tar.gz:
Publisher:
publish.yml on sigilantlabs/sigilant-sweep
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sigilant_sweep-0.1.8.tar.gz -
Subject digest:
3e1bc38606caebcb968bb0a0125b37b69a02d7c0f7ee6c72efcf5bdd9b992d16 - Sigstore transparency entry: 1644249522
- Sigstore integration time:
-
Permalink:
sigilantlabs/sigilant-sweep@95f236d13896d1ba606653c26960ebbf5b4a2ef0 -
Branch / Tag:
refs/tags/v0.1.8 - Owner: https://github.com/sigilantlabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@95f236d13896d1ba606653c26960ebbf5b4a2ef0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file sigilant_sweep-0.1.8-py3-none-any.whl.
File metadata
- Download URL: sigilant_sweep-0.1.8-py3-none-any.whl
- Upload date:
- Size: 75.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4158afe84a6e5ac1b651ae186a8207f34fd7d7cf141c4a44b87d6cd0cc4814f
|
|
| MD5 |
9f6dea6726ed8bb427e6b441ad9ee617
|
|
| BLAKE2b-256 |
8b909cfba685b1c3f0f3a8647b50efb6ce528df4a4e301c1b192e238e9eb0a0a
|
Provenance
The following attestation bundles were made for sigilant_sweep-0.1.8-py3-none-any.whl:
Publisher:
publish.yml on sigilantlabs/sigilant-sweep
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sigilant_sweep-0.1.8-py3-none-any.whl -
Subject digest:
c4158afe84a6e5ac1b651ae186a8207f34fd7d7cf141c4a44b87d6cd0cc4814f - Sigstore transparency entry: 1644249973
- Sigstore integration time:
-
Permalink:
sigilantlabs/sigilant-sweep@95f236d13896d1ba606653c26960ebbf5b4a2ef0 -
Branch / Tag:
refs/tags/v0.1.8 - Owner: https://github.com/sigilantlabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@95f236d13896d1ba606653c26960ebbf5b4a2ef0 -
Trigger Event:
push
-
Statement type: