Open-source inference sweep for llama.cpp and vLLM: TPS, TTFT, ITL, and PPL across 16 configs.
Project description
sigilant-sweep
Benchmark orchestration for inference stacks (llama.cpp, vLLM): TPS, TTFT, ITL, PPL proxy, and artifacted comparisons.
Scope • Install • First-time success • Metrics • Reproducibility
Scope
sigilant-sweep is orchestration and reporting around existing inference engines.
It handles:
- config generation
- benchmark execution via adapters (
llama.cpp,vllm) - metric parsing (TPS, TTFT, ITL, PPL proxy)
- scoring and artifact export
It is not a new inference runtime.
Non-goals
- custom kernels or scheduler innovation
- replacing engine internals (
llama.cpp,vllm) - claiming production safety certification from throughput benchmarks
Install
# Base (lightweight CLI + reporting)
pip install sigilant-sweep
# Hugging Face integration only
pip install 'sigilant-sweep[hf]'
# With llama.cpp
pip install 'sigilant-sweep[llama]'
# With llama.cpp + CUDA acceleration
CMAKE_ARGS="-DGGML_CUDA=on" pip install 'sigilant-sweep[llama]'
# With vLLM (Linux + CUDA only)
pip install 'sigilant-sweep[vllm]'
# With Modal cloud backend
pip install 'sigilant-sweep[modal]'
# With RunPod cloud backend
pip install 'sigilant-sweep[runpod]'
# Everything
pip install 'sigilant-sweep[all]'
First-time success guide
Golden path: Modal (recommended)
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip setuptools wheel
pip install "sigilant-sweep[modal]"
modal token new
sigilant-sweep info
Run a cheap sanity test:
sigilant-sweep run \
--model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
--backend modal \
--engine llama.cpp \
--hardware l4 \
--configs 1 \
--trials 1
Golden path: Local llama.cpp
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip setuptools wheel
pip install sigilant-sweep
Requirements:
llama-climust be installed and discoverable onPATH, or setSIGILANT_LLAMA_CLI=/abs/path/to/llama-cli.- Local backend is compute-dependent; on CPU-only machines it will be slow.
Compatibility matrix (current recommendation)
| Scenario | Recommended install | Notes |
|---|---|---|
| Any OS, Modal-only | pip install "sigilant-sweep[modal]" |
Best first-run success path |
| Any OS, HF-only | pip install "sigilant-sweep[hf]" |
For model listing/download integration |
| Local llama.cpp | pip install sigilant-sweep |
Requires external llama-cli binary |
| Local vLLM | pip install "sigilant-sweep[vllm]" |
Linux + CUDA only |
Known install issue (Intel macOS + Modal extras)
If you see Failed building wheel for cbor2:
pip uninstall -y modal cbor2
pip install --only-binary=:all: "cbor2==5.6.5"
pip install "sigilant-sweep[modal]"
Then verify:
python3 -c "import modal, cbor2; print('modal', modal.__version__, 'cbor2_ok', hasattr(cbor2, 'dumps'))"
Quick start
# 1. Check hardware and credentials
sigilant-sweep setup
# 2. Show what's detected on this machine
sigilant-sweep info
# 3. Run a sweep (local GPU, llama.cpp)
sigilant-sweep run --model mistralai/Mistral-7B-Instruct-v0.3
# 4. Save results to JSON
sigilant-sweep run --model mistralai/Mistral-7B-Instruct-v0.3 --json
Modal run example
sigilant-sweep run \
--model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
--backend modal \
--engine llama.cpp \
--hardware l4 \
--score-profile balanced \
--agent-smoke
Output:
- ranked configs with score and status
- baseline delta line
sigilant_results.json,sigilant_summary.md,sigilant_frontier.svg- optional smoke diagnosis (
model_limitedvsharness_limitedvsmixed)
Stability notes:
- Default is fixed
--trials 12for stronger stability out of the box. - You can override
--trialsmanually for faster/cheaper or deeper runs. - Artifacts include confidence inputs: top-2 gap and variance proxy.
Common run patterns
Single config only:
sigilant-sweep run \
--model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
--backend modal \
--engine llama.cpp \
--hardware l4 \
--configs 16 \
--trials 1 \
--only-config "Q4_K_M,8192,k16v16,default"
Depth profile:
sigilant-sweep run \
--model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
--backend modal \
--engine llama.cpp \
--hardware l4 \
--configs 16 \
--trials 5 \
--benchmark-mode depth_profile \
--depth-prompt-8k prompts/hard_quality_8k_prompt.txt \
--depth-prompt-14k prompts/hard_quality_14k_prompt.txt \
--depth-prompt-28k prompts/hard_quality_28k_prompt.txt
Run with smoke check:
sigilant-sweep run \
--model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
--backend modal \
--engine llama.cpp \
--hardware l4 \
--configs 16 \
--trials 5 \
--agent-smoke
Execution model
- CLI resolves model files, builds the config grid, dispatches to backend, and scores results.
- llama.cpp path runs timed generation and perplexity per config/trial, then aggregates (
p50,p95,mean PPL). - Multi-trial runs are rotated trial-first to avoid running all trials of one config back-to-back.
- Artifacts are written under
artifacts/runs/<run_id>/.
Troubleshooting
-
Model resolution failed: huggingface-hub is required: installpip install "sigilant-sweep[hf]"orpip install "sigilant-sweep[modal]". -
Error: modal is not installed: installpip install "sigilant-sweep[modal]". -
Version ... of modal is deprecated: upgrade modal in venv:pip install -U modal. -
Failed building wheel for cbor2(Intel macOS path) : runpip uninstall -y modal cbor2 && pip install --only-binary=:all: "cbor2==5.6.5" && pip install "sigilant-sweep[modal]". -
vLLM local failures on macOS/Windows : expected; use Modal backend for vLLM.
Hardware options
| Flag | Where it runs |
|---|---|
--backend local |
Your machine (default) |
--backend modal |
Modal cloud (your account) |
--backend runpod |
RunPod cloud (your account) |
--hardware value |
GPU | VRAM |
|---|---|---|
auto |
auto-detect | n/a |
a10g |
NVIDIA A10G | 24 GB |
a100 |
NVIDIA A100 | 40 GB |
h100 |
NVIDIA H100 | 80 GB |
l4 |
NVIDIA L4 | 24 GB |
t4 |
NVIDIA T4 | 16 GB |
rtx4090 |
RTX 4090 | 24 GB |
rtx3090 |
RTX 3090 | 24 GB |
rtxa6000 |
RTX A6000 | 48 GB |
Engine options
| Flag | Supported Backends | Notes |
|---|---|---|
--engine llama.cpp |
local, modal, runpod |
GGUF-based flow |
--engine vllm |
local, modal |
Linux + CUDA required |
Full CLI reference
sigilant-sweep run [OPTIONS]
--model -m HuggingFace repo ID or local .gguf path [required]
--backend -b local | modal | runpod [default: local]
--engine -e llama.cpp | vllm [default: llama.cpp]
--hardware GPU target (see table above) [default: auto]
--params-b Model size in billions (for VRAM estimate) [default: 7.0]
--configs Max number of configs to sweep [default: 16]
--confidence-target low | medium | high [default: medium] (reporting only)
--score-profile balanced | latency | quality [default: balanced]
--trials Trials per config [default: 12]
--json Also write results to sigilant_results.json
sigilant-sweep setup Check credentials for all backends (interactive)
sigilant-sweep info Show detected hardware and installed engines
sigilant-sweep --version
Cloud backend setup
Modal
pip install 'sigilant-sweep[modal]'
modal token new # saves credentials to ~/.modal.toml
sigilant-sweep run --model mistralai/Mistral-7B-Instruct-v0.3 --backend modal --hardware a10g
RunPod
pip install 'sigilant-sweep[runpod]'
export RUNPOD_API_KEY=<your-key>
export SIGILANT_RUNPOD_ENDPOINT_ID=<your-predeployed-endpoint-id>
sigilant-sweep run --model mistralai/Mistral-7B-Instruct-v0.3 --backend runpod --engine llama.cpp --hardware rtx4090
What this measures
| Metric | Description |
|---|---|
| TPS | Output tokens per second |
| TTFT | Time to first token (ms) |
| ITL | Inter-token latency (ms) |
| PPL | Perplexity on a fixed corpus, used as a lightweight quality proxy |
| Score | Sigilant composite (preset-based): balanced/latency/quality profiles |
What this does NOT measure
- Tool calling correctness
- Structured JSON / schema output validity
- Hallucination resistance
- Prompt injection resistance
- Long-context retrieval (NIAH)
PPL catches gross quantization degradation. It does not validate production agent safety.
Prompt corpus note:
- Prompt and corpus files in
prompts/are benchmark assets maintained for this harness. - They are intended for relative configuration comparison, not as a standardized external evaluation set.
Verification and reproducibility
- Keep raw artifacts with reported tables (
sigilant_results.json,sigilant_terminal.txt). - Re-run top candidates with
--only-configbefore final selection. - Separate infra/control-plane failures from model/runtime failures.
- Treat PPL as a ranking proxy within comparable runs.
vLLM status
- Implemented:
- local vLLM sweep
- Modal vLLM sweep (HF model localized at run start and reused through the sweep)
- Not implemented yet:
- RunPod vLLM backend
- vLLM structured-output smoke
PPL corpus quality note:
- Current PPL corpus is intentionally lightweight and should be treated as a coarse proxy.
- For close winners, a small/synthetic corpus can under-separate configs.
- Use higher trials for stability, and treat PPL as directional unless you swap in a larger, domain-representative corpus.
Boundary:
- OSS
sigilant-sweep: config ranking, runtime metrics, and lightweight smoke triage. - For broader capability/safety validation on production workloads, use Sigilant Optimizer.
Score profiles
balanced:40% TPS + 20% TTFT + 40% PPLlatency:50% TPS + 30% TTFT + 20% PPLquality:30% TPS + 20% TTFT + 50% PPL
If PPL is unavailable, TPS/TTFT weights are renormalized automatically.
License
Apache 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sigilant_sweep-0.1.2.tar.gz.
File metadata
- Download URL: sigilant_sweep-0.1.2.tar.gz
- Upload date:
- Size: 797.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a03f30c03baa355eaf4153ca873c65816fe7f0ba780170588b1168ff7945db2c
|
|
| MD5 |
ec68059bd219d81a908e2faa6093cfba
|
|
| BLAKE2b-256 |
bf2267942741896136b74e093ed21aafb81f83bf5ed3f5f542bc9419c9c1328f
|
File details
Details for the file sigilant_sweep-0.1.2-py3-none-any.whl.
File metadata
- Download URL: sigilant_sweep-0.1.2-py3-none-any.whl
- Upload date:
- Size: 72.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31ae4aa970c4da66fa3edb2c8751a5c611596c57b7aad38cd11dd997b9475bfe
|
|
| MD5 |
edcc75901a0e743f2b39ea3d03312e16
|
|
| BLAKE2b-256 |
7eea0c32557da2ca84bd23886615c3e97b2c0862153896076fb7bd82a9630fcf
|