Skip to main content

Production-ready quantization for large language and multimodal models

Project description

Pare

Quantize any LLM in one line. Switch between GPTQ, AWQ, SmoothQuant, and RTN by changing a config field.


Benchmarks

WikiText-2 perplexity (PPL ↓), A40 46 GB:

Method Llama-3.1-8B Qwen2.5-7B OLMo-3-7B
FP16 baseline 6.24 6.85 9.92
RTN INT8 6.25 (+0.01) 6.85 (+0.00) 9.92 (+0.00)
GPTQ INT4 11.10 (+4.86) ‡ 7.04 (+0.19) 10.21 (+0.29)
AWQ INT4 6.77 (+0.53) 7.13 (+0.28) 10.36 (+0.44)

Zero-shot accuracy — 6-task average (LAMBADA, PIQA, WinoGrande, OpenBookQA, RTE, COPA) ↑:

Method Llama-3.1-8B Qwen2.5-7B OLMo-3-7B
FP16 baseline 73.22 74.13 69.57
RTN INT8 73.00 (−0.22) 74.09 (−0.04) 69.57 (0.00)
GPTQ INT4 71.65 (−1.57) 73.39 (−0.74) 69.42 (−0.15)
AWQ INT4 70.69 (−2.53) 73.93 (−0.20) 69.57 (0.00)

Throughput at BS=1 (tok/s), dequantize-on-the-fly †:

Method Llama-3.1-8B Qwen2.5-7B OLMo-3-7B
FP16 25.8 32.4 25.1
RTN INT8 2.1 2.3 2.3
GPTQ INT4 1.1 1.2 1.2
AWQ INT4 1.1 1.2 1.2

† With the optional Triton kernel: 8.8× faster at BS=1, 2.8× at BS=4.

‡ Llama-3.1-8B is sensitive to column ordering. With act_order=True: PPL improves from 11.10 to 6.54 (+0.30), accuracy moves from 71.65 to 70.05. Qwen2.5-7B and OLMo-3-7B are unaffected (PPL: 7.04 to 7.02, 10.21 to 10.16).


Installation

pip install pare-quant                   # latest
pip install pare-quant==0.1.0           # pin to specific version
pip install "pare-quant[all]"            # + transformers, datasets, Triton kernel

Python ≥ 3.11 · PyTorch ≥ 2.1


Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from pare import quantize, QuantConfig
from pare.calibration.data import load_wikitext2_calibration

model     = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")
calib     = load_wikitext2_calibration(tokenizer, n_samples=128, seq_len=2048)
# or use your own: a list of tokenized tensors of shape (seq_len,)

# Default is AWQ. Change scheme= to switch methods.
config = QuantConfig(bits=4, scheme="awq", group_size=128)   # ← swap to "gptq", "rtn", "smoothquant"
model  = quantize(model, config, calibration_data=calib, device="cuda")

Save and reload:

from pare import save_quantized, load_quantized

save_quantized(model, "qwen25-awq-int4/")
# [pare] Saved 224 quantized layers to qwen25-awq-int4  (3821 MB)

from transformers import AutoConfig, AutoModelForCausalLM
config = AutoConfig.from_pretrained("Qwen/Qwen2.5-7B")
model  = AutoModelForCausalLM.from_config(config)   # architecture only, no weights
model  = load_quantized(model, "qwen25-awq-int4/")

Methods

scheme= Calibration Quality When to use
"awq" Yes ★★★★ Default. Best robustness across architectures; recommended starting point
"gptq" Yes ★★★★ Matches AWQ on Qwen2.5-7B; architecture-agnostic — works correctly across pre- and post-norm models
"smoothquant" Yes ★★★★ INT8 W+A; closest to FP16 PPL; no INT4
"rtn" No ★★★ No calibration needed; good baseline or for NF4/FP8

★ Default: QuantConfig() uses AWQ. AWQ is the strongest INT4 method on Qwen2.5-7B (−0.20 vs FP16 baseline). GPTQ is architecture-agnostic and is recommended when the target model's architecture is uncertain. On Llama-3.x architectures, act_order=True is recommended; it reduces PPL from 11.10 to 6.54 on Llama-3.1-8B. On Qwen2.5 and OLMo-3 the effect is negligible.

All schemes support bits=4 or bits=8. Use group_size=128 (default) for best INT4 quality.

Additional options

act_order=True — Sort quantization by activation magnitude (improves GPTQ quality on modern architectures):

QuantConfig(bits=4, scheme="gptq", group_size=128, act_order=True)

Mixed-precision — Automatically promote sensitive layers to higher bits:

QuantConfig(bits=4, scheme="awq", sensitive_bits=8, sensitivity_threshold=0.05)
# [pare] 12 of 224 layers promoted to INT8 based on activation-weighted error

NF4 — Normal float 4-bit codebook (QLoRA-compatible base model format):

from pare.core.dtype import QuantDtype
QuantConfig(bits=4, dtype=QuantDtype.NF4, scheme="rtn")

FP8 — 8-bit float for A100/H100:

QuantConfig(bits=8, dtype=QuantDtype.FP8_E4M3, scheme="rtn")

Inference speedup (Triton kernel)

The optional Triton INT4 kernel fuses dequantization into the matmul, avoiding materialising the full FP16 weight matrix. Applies to INT4 schemes (AWQ, GPTQ, RTN). Enable per-layer after quantization:

from pare.layers.linear import QuantizedLinear

for m in model.modules():
    if isinstance(m, QuantizedLinear):
        m.use_kernel = True
Batch size Without kernel With kernel Speedup
1 (decode) 2.09 ms/layer 0.24 ms/layer 8.8×
4 2.18 ms/layer 0.78 ms/layer 2.8×
16 2.66 ms/layer 3.18 ms/layer 0.8×

Requires pip install triton>=3.0.


Hardware

Minimum
Quantizing a 7B model 20 GB VRAM (layerwise strategy peaks at ~2 GB)
RTN / GPTQ / AWQ / NF4 Any CUDA GPU
SmoothQuant W+A Any CUDA GPU
FP8 PyTorch ≥ 2.1 (A100 via software; H100 native)
Triton kernel CUDA GPU + triton ≥ 3.0

References

The methods implemented in Pare are from the following papers:

  • GPTQ — Frantar, Ashkboos, Hoefler, Alistarh. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023. arXiv:2210.17323
  • AWQ — Lin et al. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. MLSys 2024. arXiv:2306.00978
  • SmoothQuant — Xiao, Lin, Seznec, Wu, Demouth, Han. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. ICML 2023. arXiv:2211.10438
  • KIVI — Liu, Yuan, Jin, Zhong, Xu, Braverman, Chen, Hu. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. ICML 2024. arXiv:2402.02750
  • NF4 / QLoRA — Dettmers, Pagnoni, Holtzman, Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. arXiv:2305.14314

Citation

@misc{moslem2026pare,
  author = {Moslem, Yasmin},
  title  = {Pare: Production-ready quantization for large language models},
  year   = {2026},
  url    = {https://github.com/TinyAdapt/Pare},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pare_quant-0.1.3.tar.gz (110.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pare_quant-0.1.3-py3-none-any.whl (66.8 kB view details)

Uploaded Python 3

File details

Details for the file pare_quant-0.1.3.tar.gz.

File metadata

  • Download URL: pare_quant-0.1.3.tar.gz
  • Upload date:
  • Size: 110.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for pare_quant-0.1.3.tar.gz
Algorithm Hash digest
SHA256 263733aee4b7c8fde92996e413651fc817e2dabefe2a9f6b4e6a85e809fe50dc
MD5 92a0d09d61ee400b2e06042b58603061
BLAKE2b-256 711a87fc1f1c7ac0c4643f8a117b7ea926e2d8a5365c2dd5582e06838c45ccff

See more details on using hashes here.

File details

Details for the file pare_quant-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: pare_quant-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 66.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for pare_quant-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 16a024d2032e78cecd4776264aff527220f6557d7529f4dda62ace6f0ff2abe1
MD5 85f1b0d13ab660592c4ce94e18a52222
BLAKE2b-256 c0ecc6f8c4651a60c552a1aa293d9a9077097fc43c42e3f129d5043f1ed6ddc0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page