Skip to main content

Quantize an LLM and check it still refuses what it should — a GPU-aware quantization CLI with a built-in safety-tax check.

Project description

quantfit

Quantize an LLM — and check it still refuses what it should.

Quantization makes a model cheaper to serve. It can also quietly strip safety behavior: a 4-bit model that answers prompts the fp16 model refused is a regression you will not see in a perplexity number. quantfit quantizes across the SOTA method matrix, is honest about whether a model fits your GPU, and — uniquely — measures the safety tax of the quantization it just performed.

pip install quantfit

quantfit check        --model Qwen/Qwen2.5-7B-Instruct                 # will it fit? (no download)
quantfit quantize     --model Qwen/Qwen2.5-1.5B-Instruct --method awq --out ./out
quantfit verify-safety --fp16 Qwen/Qwen2.5-1.5B-Instruct --quant ./out  # did quantization break refusals?

The safety check — what nothing else does

verify-safety generates from both the fp16 baseline and the quantized model over a curated probe set, judges each response refusal/compliance with a local classifier, and reports the tax as a vector, the way it actually matters:

safety-tax over 40 probes (REGRESSION):
  refusal-robustness (expected-unsafe n=12): fp16 refused 12 -> quant 12 | 0 harmful-compliance regressions
  over-refusal       (expected-safe   n=28): fp16 refused 18 -> quant 18 | 2 new false refusals
  by zone: borderline[10->10/16] clear_safe[8->8/12] clear_unsafe[12->12/12]

Two axes, not one number:

  • refusal-robustness — on prompts that should be refused, did the quant start complying? (the dangerous direction)
  • over-refusal — on prompts that should be answered, did the quant start refusing? (the usability direction)

A scalar refusal-delta can read 0 while both axes move in opposite directions; the vector + per-zone breakdown catches it. Local judge, curated public probes, no external API and no raw harmful corpora — so the check is distributable.

GPU-aware quantization

3-tier capacity. check reads HF metadata (no download) to estimate the footprint: fits VRAM → quantize in-GPU; too big for VRAM but fits RAM+disk → CPU offload (a 27B can quantize on a 12 GB GPU); won't fit even offloaded → refuse, naming the real limit. No OOM 20 minutes into a job.

Method × scheme matrix (one llm-compressor backend, vLLM-loadable):

method what default scheme
awq activation-aware weight quant (best 4-bit quality) W4A16_ASYM
gptq Hessian/OBQ weight quant W4A16
smoothquant activation smoothing + W8A8 W8A8
fp8 FP8 E4M3 dynamic, no calibration FP8_DYNAMIC
rtn round-to-nearest baseline W4A16

Schemes (--scheme): W4A16, W4A16_ASYM, W8A16, W8A8, INT8, W4A8, FP8_DYNAMIC, NVFP4, MXFP4. Defaults are the validated paths; FP4 schemes need Blackwell to serve (quantfit can still produce them anywhere).

GGUF (--method gguf) for Ollama / llama.cpp: Q2_K..Q8_0 + IQ-quants. Auto-provisions the prebuilt llama-quantize binary + convert script (override with QUANTFIT_LLAMACPP).

One frozen packed calibration (wikitext-103, 128 samples, seq-len 2048, seed 42, group-size 128) is shared across the calibrated methods, so they are comparable.

What it is — and isn't

  • It quantizes (wrapping llm-compressor + llama.cpp) and checks safety preservation. Both are real and validated end-to-end.
  • It does not auto-select the config for you yet — you pick --method. Automatic config selection is a real capability, but it is published research (AMQ, KL-Lens); a routing layer that implements it is planned, not claimed here.

Docker

Dockerfile builds an isolated CUDA image. For GGUF in Docker, the official ghcr.io/ggml-org/llama.cpp:full image carries the convert + quantize tooling.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quantfit-0.1.0.tar.gz (25.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quantfit-0.1.0-py3-none-any.whl (25.4 kB view details)

Uploaded Python 3

File details

Details for the file quantfit-0.1.0.tar.gz.

File metadata

  • Download URL: quantfit-0.1.0.tar.gz
  • Upload date:
  • Size: 25.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quantfit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 41fc4ee07dce3b6ca1caab127d686805ceca27c30d6d109acb46ce71d5ca784e
MD5 acbc2fd09bf38c1676dd8fba44cab171
BLAKE2b-256 460e3ef9073aa350f01f91f97caf9390669ce1dee6187f74aa0fcaa7041c0307

See more details on using hashes here.

Provenance

The following attestation bundles were made for quantfit-0.1.0.tar.gz:

Publisher: publish.yml on Sahil170595/quantfit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file quantfit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: quantfit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quantfit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a58ccd031651c9a757ad6e461a3a3d9a7ffe6b34201f27d02148cc89813b3231
MD5 2bdcbfcc8b55665f12e01b9643d0816d
BLAKE2b-256 857682f8168c6b88da4d4f7ecc5183980860ae834d461990dd5cd3b1f2553969

See more details on using hashes here.

Provenance

The following attestation bundles were made for quantfit-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Sahil170595/quantfit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page