Skip to main content

ParoQuant — Pairwise Rotation Quantization for LLMs

Project description

ParoQuant

Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Paper Blog Models PyPI

State-of-the-art INT4 quantization for LLMs. ParoQuant uses learned pairwise rotations to suppress weight outliers, closing the accuracy gap with FP16 while running at near-AWQ speed. Supports NVIDIA GPUs (vLLM, Transformers) and Apple Silicon (MLX).

Quick Start

Installation

# NVIDIA GPU (CUDA 12.9)
pip install "paroquant[vllm]"

# NVIDIA GPU (CUDA 13.0)
pip install "paroquant[vllm]" "vllm==0.19.1" \
  --extra-index-url https://wheels.vllm.ai/0.19.1/cu130 \
  --extra-index-url https://download.pytorch.org/whl/cu130

# Apple Silicon
pip install "paroquant[mlx]"

Pick a model from our Hugging Face collection:

export MODEL=z-lab/Qwen3.5-4B-PARO

Interactive Chat

python -m paroquant.cli.chat --model $MODEL

OpenAI-Compatible API Server

For vLLM, you can directly use vllm serve to serve ParoQuant models:

vllm serve $MODEL --port 8000

For other frameworks:

python -m paroquant.cli.serve --model $MODEL --port 8000

For MLX, add --vlm if you wish to load the VLM components and use the model's multimodal features. For vLLM, VLM components are loaded by default and can be skipped with the server argument --language-model-only.

Docker (NVIDIA GPU)

[!NOTE] The following commands map the local cache directory to the container in order to persist kernel cache across runs. Remove -v ... to disable this behaviour.

# Interactive chat
docker run --pull=always --rm -it --gpus all --ipc=host \
  -v $HOME/.cache/paroquant:/root/.cache/paroquant \
  ghcr.io/z-lab/paroquant:chat --model $MODEL

# API server (port 8000)
docker run --pull=always --rm -it --gpus all --ipc=host -p 8000:8000 \
  -v $HOME/.cache/paroquant:/root/.cache/paroquant \
  ghcr.io/z-lab/paroquant:serve --model $MODEL

Models

All models are available on Hugging Face. Swap the model name in the commands above to try any of them.

Gemma 4

Model Checkpoint
gemma-4-31B-it z-lab/gemma-4-31B-it-PARO
gemma-4-E2B-it z-lab/gemma-4-E2B-it-PARO

Qwen3.6

Model Checkpoint
Qwen3.6-27B z-lab/Qwen3.6-27B-PARO

Qwen3.5

Model Checkpoint
Qwen3.5-0.8B z-lab/Qwen3.5-0.8B-PARO
Qwen3.5-2B z-lab/Qwen3.5-2B-PARO
Qwen3.5-4B z-lab/Qwen3.5-4B-PARO
Qwen3.5-9B z-lab/Qwen3.5-9B-PARO
Qwen3.5-27B z-lab/Qwen3.5-27B-PARO
Qwen3.5-35B-A3B z-lab/Qwen3.5-35B-A3B-PARO

Qwen3

Model Checkpoint
Qwen3-0.6B z-lab/Qwen3-0.6B-PARO
Qwen3-1.7B z-lab/Qwen3-1.7B-PARO
Qwen3-4B z-lab/Qwen3-4B-PARO
Qwen3-8B z-lab/Qwen3-8B-PARO
Qwen3-14B z-lab/Qwen3-14B-PARO

Llama

Model Checkpoint
Llama-2-7B z-lab/Llama-2-7b-hf-PARO
Llama-3-8B z-lab/Meta-Llama-3-8B-PARO
Llama-3.1-8B-Instruct z-lab/Llama-3.1-8B-Instruct-PARO

Want a model that's not listed? Open an issue and let us know.

Reproduction

[!NOTE] The main branch of this repository is under active development, and reproducibility is not guaranteed. Please use the legacy branch to reproduce results from the paper.

Quantize Your Own Model

git clone https://github.com/z-lab/paroquant && cd paroquant
pip install -e ".[optim,eval]"

# 1. Optimize rotation parameters
experiments/optimize/4bit.sh Qwen/Qwen3-8B

# 2. Export to HF checkpoint (--mode real for INT4, --mode pseudo for FP16)
python -m paroquant.cli.convert \
  --model Qwen/Qwen3-8B \
  --result-dir output/Qwen3-8B \
  --output-path models/Qwen3-8B-PARO

Docker Images

Image Purpose
ghcr.io/z-lab/paroquant:chat Interactive chat
ghcr.io/z-lab/paroquant:chat-cu129 Interactive chat (CUDA 12.9)
ghcr.io/z-lab/paroquant:serve OpenAI-compatible API server
ghcr.io/z-lab/paroquant:latest Optimization & evaluation
ghcr.io/z-lab/paroquant:eval Reasoning task evaluation

Citation

@inproceedings{liang2026paroquant,
  title     = {{ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference}},
  author    = {Liang, Yesheng and Chen, Haisheng and Zhang, Zihan and Han, Song and Liu, Zhijian},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paroquant-0.1.14.tar.gz (52.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paroquant-0.1.14-py3-none-any.whl (62.9 kB view details)

Uploaded Python 3

File details

Details for the file paroquant-0.1.14.tar.gz.

File metadata

  • Download URL: paroquant-0.1.14.tar.gz
  • Upload date:
  • Size: 52.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paroquant-0.1.14.tar.gz
Algorithm Hash digest
SHA256 63d89c32396c2fd6fa2badcba847c15c0ec271e4e820241f8d4f8574edaf03e7
MD5 2f92107dbfdf5b98d01a79b78974afad
BLAKE2b-256 a584ffd00ca8cf9c1b823ee9d1fe565aa75dcdcc1391e0c1bdb73188bbca4cfa

See more details on using hashes here.

Provenance

The following attestation bundles were made for paroquant-0.1.14.tar.gz:

Publisher: publish-to-pypi.yml on z-lab/paroquant

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file paroquant-0.1.14-py3-none-any.whl.

File metadata

  • Download URL: paroquant-0.1.14-py3-none-any.whl
  • Upload date:
  • Size: 62.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paroquant-0.1.14-py3-none-any.whl
Algorithm Hash digest
SHA256 96ba6ec97a3eb00a2f37aee90b3a912b1febf38ac3777a40fa69e59eb874c99a
MD5 0e08ea65605613e1bf8b47af0bf8a76d
BLAKE2b-256 b8c7ac69eedbb31c278e6c152c1f83719c54d23c0d234244d64d18ae504c78d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for paroquant-0.1.14-py3-none-any.whl:

Publisher: publish-to-pypi.yml on z-lab/paroquant

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page