Open-source LLM inference sweep — TPS, TTFT, ITL, PPL across 16 configs on your hardware.

These details have not been verified by PyPI

Project description

sigilant-sweep

Open-source LLM inference sweep. Measure TPS, TTFT, ITL, and PPL across 16 configurations on your own hardware — local GPU, Modal, or RunPod.

sigilant-sweep · Mistral-7B-Instruct-v0.3 · RTX 4090 24GB · llama.cpp · 16 configs

Config                                      TPS     TTFT    ITL     PPL    Score
──────────────────────────────────────────────────────────────────────────────────
Q5_K_M · ctx:16384 · kv:f16   · b:4        53.3    612ms   19.2ms  8.44   91  ← best
Q5_K_M · ctx:8192  · kv:f16   · b:4        53.1    609ms   19.1ms  8.44   89
Q4_K_M · ctx:16384 · kv:f16   · b:4        56.2    591ms   18.1ms  8.71   87
Q4_K_M · ctx:8192  · kv:f16   · b:4        55.8    594ms   18.3ms  8.71   85
... 12 more configs

Best config:  Q5_K_M · ctx:16384 · kv:f16 · b:4

PPL is a quality proxy, not production validation.

! Agent safety NOT evaluated.
  Structural JSON, tool calling, hallucination resistance,
  and prompt injection are not covered by this sweep.

  → sigilantlabs.com/optimize

Install

# Base (lightweight CLI + reporting)
pip install sigilant-sweep

# Hugging Face integration only
pip install 'sigilant-sweep[hf]'

# With llama.cpp
pip install 'sigilant-sweep[llama]'

# With llama.cpp + CUDA acceleration
CMAKE_ARGS="-DGGML_CUDA=on" pip install 'sigilant-sweep[llama]'

# With vLLM (Linux + CUDA only)
pip install 'sigilant-sweep[vllm]'

# With Modal cloud backend
pip install 'sigilant-sweep[modal]'

# With RunPod cloud backend
pip install 'sigilant-sweep[runpod]'

# Everything
pip install 'sigilant-sweep[all]'

Quick start

# 1. Check hardware and credentials
sigilant-sweep setup

# 2. Show what's detected on this machine
sigilant-sweep info

# 3. Run a sweep (local GPU, llama.cpp)
sigilant-sweep run --model mistralai/Mistral-7B-Instruct-v0.3

# 4. Save results to JSON
sigilant-sweep run --model mistralai/Mistral-7B-Instruct-v0.3 --json

Quick wow path (2 minutes)

sigilant-sweep run \
  --model Qwen/Qwen2.5-1.5B-Instruct-GGUF \
  --backend modal \
  --engine llama.cpp \
  --hardware l4 \
  --score-profile balanced \
  --agent-smoke

You get:

ranked configs with deterministic winner
baseline delta line (speed and latency uplift)
sigilant_results.json, sigilant_summary.md, sigilant_frontier.svg
smoke diagnosis (model_limited vs harness_limited vs mixed)

Confidence guardrails:

Default is fixed --trials 12 for stronger stability out of the box.
You can override --trials manually for faster/cheaper or deeper runs.
Artifacts include confidence inputs: top-2 gap and variance proxy.

Hardware options

Flag	Where it runs
`--backend local`	Your machine (default)
`--backend modal`	Modal cloud (your account)
`--backend runpod`	RunPod cloud (your account)

`--hardware` value	GPU	VRAM
`auto`	auto-detect	—
`a10g`	NVIDIA A10G	24 GB
`a100`	NVIDIA A100	40 GB
`h100`	NVIDIA H100	80 GB
`l4`	NVIDIA L4	24 GB
`t4`	NVIDIA T4	16 GB
`rtx4090`	RTX 4090	24 GB
`rtx3090`	RTX 3090	24 GB
`rtxa6000`	RTX A6000	48 GB

Engine options

Flag	Supported Backends	Notes
`--engine llama.cpp`	`local`, `modal`, `runpod`	GGUF-based flow
`--engine vllm`	`local`, `modal`	Linux + CUDA required

Full CLI reference

sigilant-sweep run [OPTIONS]

  --model      -m    HuggingFace repo ID or local .gguf path   [required]
  --backend    -b    local | modal | runpod                     [default: local]
  --engine     -e    llama.cpp | vllm                           [default: llama.cpp]
  --hardware         GPU target (see table above)               [default: auto]
  --params-b         Model size in billions (for VRAM estimate) [default: 7.0]
  --configs          Max number of configs to sweep             [default: 16]
  --confidence-target  low | medium | high                      [default: medium] (reporting only)
  --score-profile      balanced | latency | quality             [default: balanced]
  --trials             Trials per config                        [default: 12]
  --json             Also write results to sigilant_results.json

sigilant-sweep setup    Check credentials for all backends (interactive)
sigilant-sweep info     Show detected hardware and installed engines
sigilant-sweep --version

Cloud backend setup

Modal

pip install 'sigilant-sweep[modal]'
modal token new          # saves credentials to ~/.modal.toml
sigilant-sweep run --model mistralai/Mistral-7B-Instruct-v0.3 --backend modal --hardware a10g

RunPod

pip install 'sigilant-sweep[runpod]'
export RUNPOD_API_KEY=<your-key>
sigilant-sweep deploy --backend runpod     # builds + deploys worker image (one-time)
export SIGILANT_RUNPOD_ENDPOINT_ID=<printed-endpoint-id>
sigilant-sweep run --model mistralai/Mistral-7B-Instruct-v0.3 --backend runpod --engine llama.cpp --hardware rtx4090

What this measures

Metric	Description
TPS	Output tokens per second
TTFT	Time to first token (ms)
ITL	Inter-token latency (ms)
PPL	Perplexity on a fixed corpus — lightweight quality proxy
Score	Sigilant composite (preset-based): balanced/latency/quality profiles

What this does NOT measure

Tool calling correctness
Structured JSON / schema output validity
Hallucination resistance
Prompt injection resistance
Long-context retrieval (NIAH)

PPL catches gross quantization degradation. It does not validate production agent safety.

vLLM status

Implemented:
- local vLLM sweep
- Modal vLLM sweep (HF model localized at run start and reused through the sweep)
Not implemented yet:
- RunPod vLLM backend
- vLLM agent smoke

PPL corpus quality note:

Current PPL corpus is intentionally lightweight and should be treated as a coarse proxy.
For close winners, a small/synthetic corpus can under-separate configs.
Use higher trials for stability, and treat PPL as directional unless you swap in a larger, domain-representative corpus.

Boundary:

OSS sigilant-sweep: fast config recommendation and lightweight smoke triage.
Paid Sigilant Optimizer: full safety/quality gates, long-context reliability, and deployment-grade certification.

Score profiles

balanced: 40% TPS + 20% TTFT + 40% PPL
latency: 50% TPS + 30% TTFT + 20% PPL
quality: 30% TPS + 20% TTFT + 50% PPL

If PPL is unavailable, TPS/TTFT weights are renormalized automatically.

License

Apache 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.14

May 28, 2026

0.1.13

May 28, 2026

0.1.12

May 27, 2026

0.1.9

May 27, 2026

0.1.8

May 27, 2026

0.1.7

May 27, 2026

0.1.6

May 24, 2026

0.1.5

May 24, 2026

0.1.4

May 24, 2026

0.1.3

May 24, 2026

0.1.2

May 24, 2026

0.1.1

May 16, 2026

This version

0.1.0

May 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sigilant_sweep-0.1.0.tar.gz (205.0 kB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sigilant_sweep-0.1.0-py3-none-any.whl (71.5 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file sigilant_sweep-0.1.0.tar.gz.

File metadata

Download URL: sigilant_sweep-0.1.0.tar.gz
Upload date: May 16, 2026
Size: 205.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.4

File hashes

Hashes for sigilant_sweep-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`69ef50f28e63616ebbeb7955b05bac438c7e5c2c854949899a8e99331510d3a1`
MD5	`6c573c574bcf325890e40c3dadd078fb`
BLAKE2b-256	`501d161fde324398bbfa366e6d3d18276dc6a4db94f6d68b66d5eeee4e756d50`

See more details on using hashes here.

File details

Details for the file sigilant_sweep-0.1.0-py3-none-any.whl.

File metadata

Download URL: sigilant_sweep-0.1.0-py3-none-any.whl
Upload date: May 16, 2026
Size: 71.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.4

File hashes

Hashes for sigilant_sweep-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`27f5f4f6134acad25c6f0b9113dc90b28ed11e5182010a0df62b2d760e8b4ef3`
MD5	`978000e853d556251d0bddbedf473dfb`
BLAKE2b-256	`72568419fccabe372fde96e9158168ec1e88c87d63b379f3d50e321cfc117028`

See more details on using hashes here.

sigilant-sweep 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

sigilant-sweep

Install

Quick start

Quick wow path (2 minutes)

Hardware options

Engine options

Full CLI reference

Cloud backend setup

Modal

RunPod

What this measures

What this does NOT measure

vLLM status

Score profiles

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes