Production-ready quantization for large language and multimodal models

These details have not been verified by PyPI

Project description

Pare

Quantize any LLM in one line. Switch between GPTQ, AWQ, SmoothQuant, and RTN by changing a config field — same API, same model, same output format.

Pick your trade-off

WikiText-2 perplexity (PPL ↓), A40 46 GB:

Method	Llama-2-7B	Llama-3-8B	Qwen2.5-7B
FP16 baseline	5.47	6.14	6.85
RTN INT8	5.48 (+0.01)	6.14 (+0.01)	6.85 (+0.00)
SmoothQuant INT8	5.58 (+0.11)	6.25 (+0.11)	6.96 (+0.11)
AWQ INT4	5.67 (+0.20)	6.67 (+0.53)	7.13 (+0.28)
GPTQ INT4	5.74 (+0.27)	8.75 (+2.61)	7.04 (+0.19)

Throughput on Llama-2-7B (BS=1): FP16 33 tok/s · RTN/SmoothQuant ~2.3 tok/s · AWQ/GPTQ ~1.2 tok/s †

† Dequantize-on-the-fly. With the optional Triton kernel: 8.8× faster at BS=1, 2.8× at BS=4.

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from pare import quantize, QuantConfig
from pare.calibration.data import load_wikitext2_calibration

model     = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
calib     = load_wikitext2_calibration(tokenizer, n_samples=128, seq_len=2048)
# or use your own: a list of tokenized tensors of shape (seq_len,)

# Default is AWQ. Change scheme= to switch methods.
config = QuantConfig(bits=4, scheme="awq", group_size=128)   # ← swap to "gptq", "rtn", "smoothquant"
model  = quantize(model, config, calibration_data=calib, device="cuda")

Save and reload:

from pare import save_quantized, load_quantized

save_quantized(model, "llama2-awq-int4/")
# [pare] Saved 224 quantized layers to llama2-awq-int4  (3821 MB)

from transformers import AutoConfig, AutoModelForCausalLM
config = AutoConfig.from_pretrained("meta-llama/Llama-2-7b-hf")
model  = AutoModelForCausalLM.from_config(config)   # architecture only, no weights
model  = load_quantized(model, "llama2-awq-int4/")

Installation

pip install pare-quant                   # core
pip install "pare-quant[all]"            # + transformers, datasets, Triton kernel

Python ≥ 3.11 · PyTorch ≥ 2.1

Methods

`scheme=`	Calibration	Quality	When to use
`"awq"` ★	Yes	★★★★	Default. Best robustness across architectures; recommended starting point
`"gptq"`	Yes	★★★★★	Highest quality when used with `act_order=True`; can underperform AWQ without it
`"smoothquant"`	Yes	★★★★	INT8 W+A; closest to FP16 PPL; no INT4
`"rtn"`	No	★★★	No calibration needed; good baseline or for NF4/FP8

★ Default: QuantConfig() uses AWQ. AWQ consistently outperforms GPTQ (without act_order=True) on modern architectures — on Llama-3-8B the gap is 6.67 vs 8.75 PPL. GPTQ with act_order=True is the highest-quality option but requires more tuning.

All schemes support bits=4 or bits=8. Use group_size=128 (default) for best INT4 quality.

Additional options

act_order=True — Sort quantization by activation magnitude (improves GPTQ quality on modern architectures):

QuantConfig(bits=4, scheme="gptq", group_size=128, act_order=True)

Mixed-precision — Automatically promote sensitive layers to higher bits:

QuantConfig(bits=4, scheme="awq", sensitive_bits=8, sensitivity_threshold=0.05)
# [pare] 12 of 224 layers promoted to INT8 based on activation-weighted error

NF4 — Normal float 4-bit codebook (QLoRA-compatible base model format):

from pare.core.dtype import QuantDtype
QuantConfig(bits=4, dtype=QuantDtype.NF4, scheme="rtn")

FP8 — 8-bit float for A100/H100:

QuantConfig(bits=8, dtype=QuantDtype.FP8_E4M3, scheme="rtn")

Inference speedup (Triton kernel)

The optional Triton INT4 kernel fuses dequantization into the matmul, avoiding materialising the full FP16 weight matrix. Applies to INT4 schemes (AWQ, GPTQ, RTN). Enable per-layer after quantization:

from pare.layers.linear import QuantizedLinear

for m in model.modules():
    if isinstance(m, QuantizedLinear):
        m.use_kernel = True

Batch size	Without kernel	With kernel	Speedup
1 (decode)	2.09 ms/layer	0.24 ms/layer	8.8×
4	2.18 ms/layer	0.78 ms/layer	2.8×
16	2.66 ms/layer	3.18 ms/layer	0.8×

Requires pip install triton>=3.0.

Hardware

	Minimum
Quantizing a 7B model	20 GB VRAM (layerwise strategy peaks at ~2 GB)
RTN / GPTQ / AWQ / NF4	Any CUDA GPU
SmoothQuant W+A	Any CUDA GPU
FP8	PyTorch ≥ 2.1 (A100 via software; H100 native)
Triton kernel	CUDA GPU + `triton ≥ 3.0`

Citation

@misc{moslem2026pare,
  author = {Moslem, Yasmin},
  title  = {Pare: Production-ready quantization for large language and multimodal models},
  year   = {2026},
  url    = {https://github.com/TinyAdapt/Pare},
}

Apache 2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.3

Jul 3, 2026

0.1.2

Jul 3, 2026

0.1.1

Jul 1, 2026

This version

0.1.0

Jun 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pare_quant-0.1.0.tar.gz (72.6 kB view details)

Uploaded Jun 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pare_quant-0.1.0-py3-none-any.whl (60.7 kB view details)

Uploaded Jun 29, 2026 Python 3

File details

Details for the file pare_quant-0.1.0.tar.gz.

File metadata

Download URL: pare_quant-0.1.0.tar.gz
Upload date: Jun 29, 2026
Size: 72.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for pare_quant-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0b467e841a31533399b235d66fe7a568fba6eb053a9412fd8c2d7c3e9b421c89`
MD5	`73b349eed72425c9b76f11ba5e234cdf`
BLAKE2b-256	`5053894c080c986ba53aa0789f94d32e6617153199847f342d31df6b3c750337`

See more details on using hashes here.

File details

Details for the file pare_quant-0.1.0-py3-none-any.whl.

File metadata

Download URL: pare_quant-0.1.0-py3-none-any.whl
Upload date: Jun 29, 2026
Size: 60.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for pare_quant-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6619c85a424b4df980927bd7d7dc041947f716ef4bf5dd89bfec2ecc29907aba`
MD5	`299b05257a735b682b6bef7e092cebb8`
BLAKE2b-256	`ecb3e4021d38b2fe89c3a066efff231c5fc93ca8aa4e595bfeff0b9bd4a88609`

See more details on using hashes here.

pare-quant 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Pare

Pick your trade-off

Quickstart

Installation

Methods

Additional options

Inference speedup (Triton kernel)

Hardware

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes