Quantize an LLM and check it still refuses what it should — a GPU-aware quantization CLI with a built-in safety-tax check.
Project description
quantfit
Quantize an LLM — and check it still refuses what it should.
Quantization makes a model cheaper to serve. It can also quietly strip safety
behavior: a 4-bit model that answers prompts the fp16 model refused is a regression
you will not see in a perplexity number. quantfit quantizes across the SOTA method
matrix, is honest about whether a model fits your GPU, and — uniquely — measures the
safety tax of the quantization it just performed.
pip install quantfit
quantfit check --model Qwen/Qwen2.5-7B-Instruct # will it fit? (no download)
quantfit quantize --model Qwen/Qwen2.5-1.5B-Instruct --method awq --out ./out
quantfit verify-safety --fp16 Qwen/Qwen2.5-1.5B-Instruct --quant ./out # did quantization break refusals?
The safety check — what nothing else does
verify-safety generates from both the fp16 baseline and the quantized model over a
curated probe set, judges each response refusal/compliance with a local classifier,
and reports the tax as a vector, the way it actually matters:
safety-tax over 40 probes (REGRESSION):
refusal-robustness (expected-unsafe n=12): fp16 refused 12 -> quant 12 | 0 harmful-compliance regressions
over-refusal (expected-safe n=28): fp16 refused 18 -> quant 18 | 2 new false refusals
by zone: borderline[10->10/16] clear_safe[8->8/12] clear_unsafe[12->12/12]
Two axes, not one number:
- refusal-robustness — on prompts that should be refused, did the quant start complying? (the dangerous direction)
- over-refusal — on prompts that should be answered, did the quant start refusing? (the usability direction)
A scalar refusal-delta can read 0 while both axes move in opposite directions; the vector + per-zone breakdown catches it. Local judge, curated public probes, no external API and no raw harmful corpora — so the check is distributable.
GPU-aware quantization
3-tier capacity. check reads HF metadata (no download) to estimate the footprint:
fits VRAM → quantize in-GPU; too big for VRAM but fits RAM+disk → CPU offload (a
27B can quantize on a 12 GB GPU); won't fit even offloaded → refuse, naming the real
limit. No OOM 20 minutes into a job.
Method × scheme matrix (one llm-compressor backend, vLLM-loadable):
| method | what | default scheme |
|---|---|---|
awq |
activation-aware weight quant (best 4-bit quality) | W4A16_ASYM |
gptq |
Hessian/OBQ weight quant | W4A16 |
smoothquant |
activation smoothing + W8A8 | W8A8 |
fp8 |
FP8 E4M3 dynamic, no calibration | FP8_DYNAMIC |
rtn |
round-to-nearest baseline | W4A16 |
Schemes (--scheme): W4A16, W4A16_ASYM, W8A16, W8A8, INT8, W4A8,
FP8_DYNAMIC, NVFP4, MXFP4. Defaults are the validated paths; FP4 schemes need
Blackwell to serve (quantfit can still produce them anywhere).
GGUF (--method gguf) for Ollama / llama.cpp: Q2_K..Q8_0 + IQ-quants.
Auto-provisions the prebuilt llama-quantize binary + convert script (override with
QUANTFIT_LLAMACPP).
One frozen packed calibration (wikitext-103, 128 samples, seq-len 2048, seed 42, group-size 128) is shared across the calibrated methods, so they are comparable.
What it is — and isn't
- It quantizes (wrapping llm-compressor + llama.cpp) and checks safety preservation. Both are real and validated end-to-end.
- It does not auto-select the config for you yet — you pick
--method. Automatic config selection is a real capability, but it is published research (AMQ, KL-Lens); a routing layer that implements it is planned, not claimed here.
Docker
Dockerfile builds an isolated CUDA image. For GGUF in Docker, the official
ghcr.io/ggml-org/llama.cpp:full image carries the convert + quantize tooling.
License
Apache-2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quantfit-0.1.0.tar.gz.
File metadata
- Download URL: quantfit-0.1.0.tar.gz
- Upload date:
- Size: 25.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41fc4ee07dce3b6ca1caab127d686805ceca27c30d6d109acb46ce71d5ca784e
|
|
| MD5 |
acbc2fd09bf38c1676dd8fba44cab171
|
|
| BLAKE2b-256 |
460e3ef9073aa350f01f91f97caf9390669ce1dee6187f74aa0fcaa7041c0307
|
Provenance
The following attestation bundles were made for quantfit-0.1.0.tar.gz:
Publisher:
publish.yml on Sahil170595/quantfit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
quantfit-0.1.0.tar.gz -
Subject digest:
41fc4ee07dce3b6ca1caab127d686805ceca27c30d6d109acb46ce71d5ca784e - Sigstore transparency entry: 1974095377
- Sigstore integration time:
-
Permalink:
Sahil170595/quantfit@b1618f5eff151fa0f5a77c4077fae0c986c17e0e -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Sahil170595
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b1618f5eff151fa0f5a77c4077fae0c986c17e0e -
Trigger Event:
release
-
Statement type:
File details
Details for the file quantfit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: quantfit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a58ccd031651c9a757ad6e461a3a3d9a7ffe6b34201f27d02148cc89813b3231
|
|
| MD5 |
2bdcbfcc8b55665f12e01b9643d0816d
|
|
| BLAKE2b-256 |
857682f8168c6b88da4d4f7ecc5183980860ae834d461990dd5cd3b1f2553969
|
Provenance
The following attestation bundles were made for quantfit-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Sahil170595/quantfit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
quantfit-0.1.0-py3-none-any.whl -
Subject digest:
a58ccd031651c9a757ad6e461a3a3d9a7ffe6b34201f27d02148cc89813b3231 - Sigstore transparency entry: 1974095475
- Sigstore integration time:
-
Permalink:
Sahil170595/quantfit@b1618f5eff151fa0f5a77c4077fae0c986c17e0e -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Sahil170595
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b1618f5eff151fa0f5a77c4077fae0c986c17e0e -
Trigger Event:
release
-
Statement type: