Quantize an LLM and check it still refuses what it should — a GPU-aware quantization CLI with a built-in safety-tax check.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

SahilKadadekar

These details have not been verified by PyPI

Project description

quantfit

Quantize an LLM — and check it still refuses what it should.

Quantization makes a model cheaper to serve. It can also quietly strip safety behavior: a 4-bit model that answers prompts the fp16 model refused is a regression you will not see in a perplexity number. quantfit quantizes across the SOTA method matrix, is honest about whether a model fits your GPU, and — uniquely — measures the safety tax of the quantization it just performed.

pip install quantfit

quantfit check        --model Qwen/Qwen2.5-7B-Instruct                 # will it fit? (no download)
quantfit quantize     --model Qwen/Qwen2.5-1.5B-Instruct --method awq --out ./out
quantfit verify-safety --fp16 Qwen/Qwen2.5-1.5B-Instruct --quant ./out  # did quantization break refusals?

The safety check — what nothing else does

verify-safety generates from both the fp16 baseline and the quantized model over a curated probe set, judges each response refusal/compliance with a local classifier, and reports the tax as a vector, the way it actually matters:

safety-tax over 40 probes (REGRESSION):
  refusal-robustness (expected-unsafe n=12): fp16 refused 12 -> quant 12 | 0 harmful-compliance regressions
  over-refusal       (expected-safe   n=28): fp16 refused 18 -> quant 18 | 2 new false refusals
  by zone: borderline[10->10/16] clear_safe[8->8/12] clear_unsafe[12->12/12]

Two axes, not one number:

refusal-robustness — on prompts that should be refused, did the quant start complying? (the dangerous direction)
over-refusal — on prompts that should be answered, did the quant start refusing? (the usability direction)

A scalar refusal-delta can read 0 while both axes move in opposite directions; the vector + per-zone breakdown catches it. Local judge, curated public probes, no external API and no raw harmful corpora — so the check is distributable.

GPU-aware quantization

3-tier capacity. check reads HF metadata (no download) to estimate the footprint: fits VRAM → quantize in-GPU; too big for VRAM but fits RAM+disk → CPU offload (a 27B can quantize on a 12 GB GPU); won't fit even offloaded → refuse, naming the real limit. No OOM 20 minutes into a job.

Method × scheme matrix (one llm-compressor backend, vLLM-loadable):

method	what	default scheme
`awq`	activation-aware weight quant (best 4-bit quality)	W4A16_ASYM
`gptq`	Hessian/OBQ weight quant	W4A16
`smoothquant`	activation smoothing + W8A8	W8A8
`fp8`	FP8 E4M3 dynamic, no calibration	FP8_DYNAMIC
`rtn`	round-to-nearest baseline	W4A16

Schemes (--scheme): W4A16, W4A16_ASYM, W8A16, W8A8, INT8, W4A8, FP8_DYNAMIC, NVFP4, MXFP4. Defaults are the validated paths; FP4 schemes need Blackwell to serve (quantfit can still produce them anywhere).

GGUF (--method gguf) for Ollama / llama.cpp: Q2_K..Q8_0 + IQ-quants. Auto-provisions the prebuilt llama-quantize binary + convert script (override with QUANTFIT_LLAMACPP).

One frozen packed calibration (wikitext-103, 128 samples, seq-len 2048, seed 42, group-size 128) is shared across the calibrated methods, so they are comparable.

What it is — and isn't

It quantizes (wrapping llm-compressor + llama.cpp) and checks safety preservation. Both are real and validated end-to-end.
It does not auto-select the config for you yet — you pick --method. Automatic config selection is a real capability, but it is published research (AMQ, KL-Lens); a routing layer that implements it is planned, not claimed here.

Docker

Dockerfile builds an isolated CUDA image. For GGUF in Docker, the official ghcr.io/ggml-org/llama.cpp:full image carries the convert + quantize tooling.

License

Apache-2.0.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

SahilKadadekar

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quantfit-0.1.0.tar.gz (25.3 kB view details)

Uploaded Jun 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quantfit-0.1.0-py3-none-any.whl (25.4 kB view details)

Uploaded Jun 27, 2026 Python 3

File details

Details for the file quantfit-0.1.0.tar.gz.

File metadata

Download URL: quantfit-0.1.0.tar.gz
Upload date: Jun 27, 2026
Size: 25.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quantfit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`41fc4ee07dce3b6ca1caab127d686805ceca27c30d6d109acb46ce71d5ca784e`
MD5	`acbc2fd09bf38c1676dd8fba44cab171`
BLAKE2b-256	`460e3ef9073aa350f01f91f97caf9390669ce1dee6187f74aa0fcaa7041c0307`

See more details on using hashes here.

Provenance

The following attestation bundles were made for quantfit-0.1.0.tar.gz:

Publisher: publish.yml on Sahil170595/quantfit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: quantfit-0.1.0.tar.gz
- Subject digest: 41fc4ee07dce3b6ca1caab127d686805ceca27c30d6d109acb46ce71d5ca784e
- Sigstore transparency entry: 1974095377
- Sigstore integration time: Jun 27, 2026
Source repository:
- Permalink: Sahil170595/quantfit@b1618f5eff151fa0f5a77c4077fae0c986c17e0e
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Sahil170595
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b1618f5eff151fa0f5a77c4077fae0c986c17e0e
- Trigger Event: release

File details

Details for the file quantfit-0.1.0-py3-none-any.whl.

File metadata

Download URL: quantfit-0.1.0-py3-none-any.whl
Upload date: Jun 27, 2026
Size: 25.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quantfit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a58ccd031651c9a757ad6e461a3a3d9a7ffe6b34201f27d02148cc89813b3231`
MD5	`2bdcbfcc8b55665f12e01b9643d0816d`
BLAKE2b-256	`857682f8168c6b88da4d4f7ecc5183980860ae834d461990dd5cd3b1f2553969`

See more details on using hashes here.

Provenance

The following attestation bundles were made for quantfit-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Sahil170595/quantfit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: quantfit-0.1.0-py3-none-any.whl
- Subject digest: a58ccd031651c9a757ad6e461a3a3d9a7ffe6b34201f27d02148cc89813b3231
- Sigstore transparency entry: 1974095475
- Sigstore integration time: Jun 27, 2026
Source repository:
- Permalink: Sahil170595/quantfit@b1618f5eff151fa0f5a77c4077fae0c986c17e0e
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Sahil170595
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b1618f5eff151fa0f5a77c4077fae0c986c17e0e
- Trigger Event: release

quantfit 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

quantfit

The safety check — what nothing else does

GPU-aware quantization

What it is — and isn't

Docker

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance