Skip to main content

Open-source inference sweep for llama.cpp and vLLM: TPS, TTFT, ITL, and PPL across 16 configs.

Project description

Sigilant Sweep Banner

sigilant-sweep

Evaluation orchestration for inference stacks (llama.cpp, vLLM) with Local and Modal backends: TPS, TTFT, ITL, PPL proxy, and artifacted comparisons.

PyPI License Stars

ScopeInstallRun pathsMetricsReproducibility


Scope

sigilant-sweep orchestrates config sweeps and reporting on top of existing inference engines.

It provides:

  • config generation
  • execution via adapters (llama.cpp, vllm)
  • metric parsing (TPS, TTFT, ITL, PPL proxy)
  • scoring and artifact export

It is not a new inference runtime.

Why use this instead of running one-off engine commands

  • Runs a full config grid (quant × context × KV) with consistent run settings.
  • Uses trial-first rotated execution to reduce ordering bias across configs.
  • Ranks configs on a composite score (TPS, TTFT, PPL proxy), not a single metric.
  • Supports depth passes (8k/14k/28k prompts) for context-window behavior checks.
  • Adds a structured-output smoke gate for quick post-ranking sanity checks.
  • Exports reproducible artifacts (json, md, svg, terminal log) for review and sharing.

Not in scope

  • custom kernels or scheduler innovation
  • replacing engine internals (llama.cpp, vllm)
  • claiming production safety certification from throughput measurements

Install

# Refresh installer tooling first (recommended)
python3 -m pip install -U pip

# Base (lightweight CLI + reporting)
pip install sigilant-sweep

# Hugging Face integration only
pip install 'sigilant-sweep[hf]'

# llama-cpp-python fallback only — not needed if llama-cli is on PATH
pip install 'sigilant-sweep[llama]'

# With llama-cpp-python fallback + CUDA acceleration
CMAKE_ARGS="-DGGML_CUDA=on" pip install 'sigilant-sweep[llama]'

# With vLLM (Linux + CUDA only)
pip install 'sigilant-sweep[vllm]'

# With Modal cloud backend
pip install 'sigilant-sweep[modal]'

# Everything
pip install 'sigilant-sweep[all]'

If your pip config points to a private/stale mirror, force official PyPI:

pip install --index-url https://pypi.org/simple sigilant-sweep

CLI path sanity check (recommended in every fresh venv):

hash -r
which sigilant-sweep
sigilant-sweep --version

If that points outside your active venv, use the explicit binary:

$VIRTUAL_ENV/bin/sigilant-sweep --version

Run paths

Use one of these four paths:

1) Local + llama.cpp

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip setuptools wheel
pip install sigilant-sweep

Local llama.cpp execution uses llama-cli binary by default.

If llama-cli is not on PATH, set it explicitly:

export SIGILANT_LLAMA_CLI=/abs/path/to/llama-cli

If you do not have a llama-cli binary, install Python fallback:

pip install "sigilant-sweep[llama]"

Sanity run:

sigilant-sweep run \
  --model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
  --backend local \
  --engine llama.cpp \
  --configs 1 \
  --trials 1

2) Local + vLLM (Linux + CUDA)

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip setuptools wheel
pip install "sigilant-sweep[vllm]"

Set family repo IDs (required for full-family runs):

export SIGILANT_VLLM_FP16_BASELINE_REPO="microsoft/Phi-3.5-mini-instruct"
export SIGILANT_VLLM_INT8_W8A8_REPO="anhbn/Phi-3.5-mini-instruct-quantized.w8a8"
export SIGILANT_VLLM_AWQ4_MARLIN_REPO="thesven/Phi-3.5-mini-instruct-awq"
export SIGILANT_VLLM_GPTQ4_MARLIN_REPO="thesven/Phi-3.5-mini-instruct-GPTQ-4bit"

Sanity run:

sigilant-sweep run \
  --model microsoft/Phi-3.5-mini-instruct \
  --backend local \
  --engine vllm \
  --configs 1 \
  --trials 1

3) Modal + llama.cpp

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip setuptools wheel
pip install "sigilant-sweep[modal]"
modal token new
sigilant-sweep info

Sanity run:

sigilant-sweep run \
  --model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
  --backend modal \
  --engine llama.cpp \
  --hardware l4 \
  --configs 1 \
  --trials 1

4) Modal + vLLM

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip setuptools wheel
pip install "sigilant-sweep[modal]"
modal token new

Set family repo IDs (required for full-family runs):

unset SIGILANT_VLLM_FAMILY_REPOS
export SIGILANT_VLLM_FP16_BASELINE_REPO="microsoft/Phi-3.5-mini-instruct"
export SIGILANT_VLLM_INT8_W8A8_REPO="anhbn/Phi-3.5-mini-instruct-quantized.w8a8"
export SIGILANT_VLLM_AWQ4_MARLIN_REPO="thesven/Phi-3.5-mini-instruct-awq"
export SIGILANT_VLLM_GPTQ4_MARLIN_REPO="thesven/Phi-3.5-mini-instruct-GPTQ-4bit"

Sanity run:

sigilant-sweep run \
  --model microsoft/Phi-3.5-mini-instruct \
  --backend modal \
  --engine vllm \
  --hardware l4 \
  --configs 1 \
  --trials 1

Intel macOS note (Modal extras)

If you see Failed building wheel for cbor2:

pip uninstall -y modal cbor2
pip install --only-binary=:all: "cbor2==5.6.5"
pip install "sigilant-sweep[modal]"

Then verify:

python3 -c "import modal, cbor2; print('modal', modal.__version__, 'cbor2_ok', hasattr(cbor2, 'dumps'))"

Quick start

# 1. Check hardware and credentials
sigilant-sweep setup

# 2. Show what's detected on this machine
sigilant-sweep info

# 3. Run a sweep (local GPU, llama.cpp)
sigilant-sweep run --model mistralai/Mistral-7B-Instruct-v0.3

# 4. Save results to JSON
sigilant-sweep run --model mistralai/Mistral-7B-Instruct-v0.3 --json

Example: Modal run (llama.cpp)

sigilant-sweep run \
  --model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
  --backend modal \
  --engine llama.cpp \
  --hardware l4 \
  --score-profile balanced \
  --agent-smoke

Expected output:

  • ranked config table
  • recommended config + baseline delta
  • artifacts: sigilant_results.json, sigilant_summary.md, sigilant_frontier.svg, sigilant_terminal.txt

Example output (truncated):

Config                                           TPS     TTFT      ITL     PPL   Score
──────────────────────────────────────────────────────────────────────────────────────
Q4_K_M · ctx:16384 · kv:k16v16 · long  ← best   74.1   1728ms   13.5ms   14.32     97
Q4_K_M · ctx:8192 · kv:k16v16 · default         74.0   1729ms   13.5ms   14.32     97
Q5_K_M · ctx:8192 · kv:k16v16 · default         71.4   1792ms   14.0ms   13.61     97

Best config:  Q4_K_M · ctx:16384 · kv:k16v16 · long
Auto baseline compare (auto:max_precision(Q8_0)): score Δ=+6.00  TPS Δ=+8.20  TTFT Δ=-233.9ms  PPL Δ=+0.19
Artifacts: artifacts/runs/20260524_171722/sigilant_results.json,
          artifacts/runs/20260524_171722/sigilant_summary.md,
          artifacts/runs/20260524_171722/sigilant_frontier.svg,
          artifacts/runs/20260524_171722/sigilant_terminal.txt

Example artifacts bundle:

artifacts/runs/20260524_171722/
  ├── sigilant_results.json
  ├── sigilant_summary.md
  ├── sigilant_frontier.svg
  └── sigilant_terminal.txt

Live run examples

Full vLLM sweep example (Modal, L4):

vLLM full sweep terminal output

Depth profile example (8k/14k/28k passes):

vLLM depth profile terminal output

Notes:

  • Captures below are from real runs of this repository.
  • Results vary by model, prompt set, hardware, and backend.

Run notes:

  • Default --trials is 12.
  • Lower --trials for faster/cheaper sweeps; increase for stability.
  • Artifacts include confidence inputs (for example top-2 gap).

Common run patterns

llama.cpp: single config

sigilant-sweep run \
  --model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
  --backend modal \
  --engine llama.cpp \
  --hardware l4 \
  --configs 16 \
  --trials 1 \
  --only-config "Q4_K_M,8192,k16v16,default"

llama.cpp depth profile

sigilant-sweep run \
  --model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
  --backend modal \
  --engine llama.cpp \
  --hardware l4 \
  --configs 16 \
  --trials 5 \
  --evaluation-mode depth_profile \
  --depth-prompt-8k prompts/hard_quality_8k_prompt.txt \
  --depth-prompt-14k prompts/hard_quality_14k_prompt.txt \
  --depth-prompt-28k prompts/hard_quality_28k_prompt.txt

llama.cpp with structured-output smoke

sigilant-sweep run \
  --model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
  --backend modal \
  --engine llama.cpp \
  --hardware l4 \
  --configs 16 \
  --trials 5 \
  --agent-smoke

vLLM: full-family sweep (Modal)

unset SIGILANT_VLLM_FAMILY_REPOS
export SIGILANT_VLLM_FP16_BASELINE_REPO="microsoft/Phi-3.5-mini-instruct"
export SIGILANT_VLLM_INT8_W8A8_REPO="anhbn/Phi-3.5-mini-instruct-quantized.w8a8"
export SIGILANT_VLLM_AWQ4_MARLIN_REPO="thesven/Phi-3.5-mini-instruct-awq"
export SIGILANT_VLLM_GPTQ4_MARLIN_REPO="thesven/Phi-3.5-mini-instruct-GPTQ-4bit"

sigilant-sweep run \
  --model microsoft/Phi-3.5-mini-instruct \
  --backend modal \
  --engine vllm \
  --hardware l4 \
  --configs 16 \
  --trials 1

Execution model

  • CLI resolves model files, builds the config grid, dispatches to backend, and scores results.
  • llama.cpp path runs timed generation and perplexity per config/trial, then aggregates (p50, p95, mean PPL).
  • Multi-trial runs are rotated trial-first to avoid running all trials of one config back-to-back.
  • Artifacts are written under artifacts/runs/<run_id>/.

Troubleshooting

  • Model resolution failed: huggingface-hub is required : install pip install "sigilant-sweep[hf]" or pip install "sigilant-sweep[modal]".

  • Error: modal is not installed : install pip install "sigilant-sweep[modal]".

  • Version ... of modal is deprecated : upgrade modal in venv: pip install -U modal.

  • Failed building wheel for cbor2 (Intel macOS path) : run pip uninstall -y modal cbor2 && pip install --only-binary=:all: "cbor2==5.6.5" && pip install "sigilant-sweep[modal]".

  • vLLM local failures on macOS/Windows : expected; run vLLM through Modal.

Hardware options

Backend location:

Flag Where it runs
--backend local Your machine (default)
--backend modal Modal cloud (your account)

GPU targets:

--hardware value GPU VRAM
auto auto-detect n/a
a10g NVIDIA A10G 24 GB
a100 NVIDIA A100 40 GB
h100 NVIDIA H100 80 GB
l4 NVIDIA L4 24 GB
t4 NVIDIA T4 16 GB
rtx4090 RTX 4090 24 GB
rtx3090 RTX 3090 24 GB
rtxa6000 RTX A6000 48 GB

Engine options

Flag Supported Backends Notes
--engine llama.cpp local, modal GGUF-based flow
--engine vllm local, modal Linux + CUDA required

Full CLI reference

sigilant-sweep run [OPTIONS]

  --model      -m    HuggingFace repo ID or local .gguf path   [required]
  --backend    -b    local | modal                              [default: local]
  --engine     -e    llama.cpp | vllm                           [default: llama.cpp]
  --hardware         GPU target (see table above)               [default: auto]
  --params-b         Model size in billions (for VRAM estimate) [default: 7.0]
  --configs          Max number of configs to sweep             [default: 16]
  --confidence-target  low | medium | high                      [default: medium] (reporting only)
  --score-profile      balanced | latency | quality             [default: balanced]
  --evaluation-mode      ranking | depth_profile                [default: ranking]
  --depth-prompt-8k      Path to 8k prompt file                 [default: prompts/hard_quality_8k_prompt.txt]
  --depth-prompt-14k     Path to 14k prompt file                [default: prompts/hard_quality_14k_prompt.txt]
  --depth-prompt-28k     Path to 28k prompt file                [default: prompts/hard_quality_28k_prompt.txt]
  --only-config          QUANT,CTX,KV,REGIME                    [optional]
  --trials             Trials per config                        [default: 12]
  --json             Also write results to sigilant_results.json

sigilant-sweep setup    Check credentials for all backends (interactive)
sigilant-sweep info     Show detected hardware and installed engines
sigilant-sweep --version

To check the exact options in your installed version:

sigilant-sweep --help
sigilant-sweep run --help

What this measures

Metric Description
TPS Output tokens per second
TTFT Time to first token (ms)
ITL Inter-token latency (ms)
PPL Perplexity on a fixed corpus, used as a lightweight quality proxy
Score Sigilant composite (preset-based): balanced/latency/quality profiles

What this does NOT measure

  • Tool calling correctness
  • Structured JSON / schema output validity
  • Hallucination resistance
  • Prompt injection resistance
  • Long-context retrieval (NIAH)

PPL is a lightweight quality proxy. It is not a safety or capability evaluation.

Prompt corpus note:

  • Prompt and corpus files in prompts/ are evaluation assets for this harness.
  • They are for relative config comparison, not a standard external evaluation set.

Verification and reproducibility

  • Keep raw artifacts with reported tables (sigilant_results.json, sigilant_terminal.txt).
  • Re-run top candidates with --only-config before final selection:
sigilant-sweep run \
  --model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
  --backend modal \
  --engine llama.cpp \
  --hardware l4 \
  --configs 16 \
  --trials 3 \
  --only-config "Q4_K_M,16384,k16v16,long"
  • Separate infra/control-plane failures from model/runtime failures.
  • Treat PPL as a ranking proxy within comparable runs.

PPL corpus note:

  • The default PPL corpus is lightweight and coarse.
  • Close winners may need higher trials and/or a larger domain-specific corpus.

Boundary:

  • OSS sigilant-sweep: config ranking, runtime metrics, and lightweight smoke triage.
  • For broader capability/safety validation on production workloads, use Sigilant Optimizer.

Score profiles

  • balanced: 40% TPS + 20% TTFT + 40% PPL
  • latency: 50% TPS + 30% TTFT + 20% PPL
  • quality: 30% TPS + 20% TTFT + 50% PPL

If PPL is unavailable, TPS/TTFT weights are renormalized automatically.


License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sigilant_sweep-0.1.9.tar.gz (997.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sigilant_sweep-0.1.9-py3-none-any.whl (76.2 kB view details)

Uploaded Python 3

File details

Details for the file sigilant_sweep-0.1.9.tar.gz.

File metadata

  • Download URL: sigilant_sweep-0.1.9.tar.gz
  • Upload date:
  • Size: 997.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sigilant_sweep-0.1.9.tar.gz
Algorithm Hash digest
SHA256 94999e1bc09cb972e58c68d9cb2e3f0fb0084202ae88d3d70c51dab1b17eaaaf
MD5 7bddc79e6b8e61b7076e6312abaef7c8
BLAKE2b-256 14592fe6d8b1cc6c189f75c03c1e3199bbfd62e56f9136a4febf48aef9c960e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for sigilant_sweep-0.1.9.tar.gz:

Publisher: publish.yml on sigilantlabs/sigilant-sweep

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sigilant_sweep-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: sigilant_sweep-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 76.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sigilant_sweep-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 0854d2dc528f09029b78274b37eb131ce9da3e5d2bf904ee5142a381ed1084c7
MD5 360ae7c82210ecfc53e7c65f42014e11
BLAKE2b-256 78a372512629ba9de0f424c0439a7bc77ce9dd8fa89b205329a72f3f934edfe4

See more details on using hashes here.

Provenance

The following attestation bundles were made for sigilant_sweep-0.1.9-py3-none-any.whl:

Publisher: publish.yml on sigilantlabs/sigilant-sweep

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page