Skip to main content

VPTQ (Vector Post-Training Quantization) is a novel Post-Training Quantization method.

Project description

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

TL;DR

Vector Post-Training Quantization (VPTQ) is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.

  • Better Accuracy on 1-2 bits, (405B @ <2bit, 70B @ 2bit)
  • Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
  • Agile Quantization Inference: low decode overhead, best throughput, and TTFT

News

Installation

Dependencies

  • python 3.10+
  • torch >= 2.2.0
  • transformers >= 4.44.0
  • Accelerate >= 0.33.0
  • latest datasets

Install VPTQ on your machine

recommend For saving your time to build the package, Please install VPTQ from the latest Release directly

https://github.com/microsoft/VPTQ/releases

[Not Aavailbe if Release package]

Preparation steps that might be needed: Set up CUDA PATH.

export PATH=/usr/local/cuda-12/bin/:$PATH  # set dependent on your environment

Will Take several minutes to compile CUDA kernels, please be patient. Current compilation builds on SM 7.0, 7.5, 8.0, 8,6, 9.0 to reduce the compilation time. You can set TORCH_CUDA_ARCH_LIST to your specific architecture.

pip install git+https://github.com/microsoft/VPTQ.git --no-build-isolation

Example: Run Llama 3.1 70b on RTX4090 (24G @ ~2bits) in real time Llama3 1-70b-prompt


VPTQ is an ongoing project. If the open-source community is interested in optimizing and expanding VPTQ, please feel free to submit an issue or DM.


Evaluation

Models from Open Source Community

⚠️ The repository only provides a method of model quantization algorithm.

⚠️ The open-source community VPTQ-community provides models based on the technical report and quantization algorithm.

⚠️ The repository cannot guarantee the performance of those models.

Quick Estimation of Model Bitwidth (Excluding Codebook Overhead):

  • Model Naming Convention: The model's name includes the vector length $v$, codebook (lookup table) size, and residual codebook size. For example, "Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft" is "Meta-Llama-3.1-70B-Instruct", where:

    • Vector Length: 8
    • Number of Centroids: 65536 (2^16)
    • Number of Residual Centroids: 256 (2^8)
  • Equivalent Bitwidth Calculation:

    • Index: log2(65536) = 16 / 8 = 2 bits
    • Residual Index: log2(256) = 8 / 8 = 1 bit
    • Total Bitwidth: 2 + 1 = 3 bits
  • Model Size Estimation: 70B * 3 bits / 8 bits per Byte = 26.25 GB

  • Note: This estimate does not include the size of the codebook (lookup table), other parameter overheads, and the padding overhead for storing indices. For the detailed calculation method, please refer to Tech Report Appendix C.2.

Model Series Collections (Estimated) Bit per weight
Llama 3.1 Nemotron 70B Instruct HF HF 🤗 4 bits 3 bits 2 bits (1) 2 bits (2) 1.875 bits 1.625 bits 1.5 bits
Llama 3.1 8B Instruct HF 🤗 4 bits 3.5 bits 3 bits 2.3 bits
Llama 3.1 70B Instruct HF 🤗 4 bits 3 bits 2.25 bits 2 bits (1) 2 bits (2) 1.93 bits 1.875 bits 1.75 bits
Llama 3.1 405B Instruct HF 🤗 4 bits 3 bits 2 bits 1.875 bits 1.625 bits 1.5 bits (1) 1.5 bits (2) 1.43 bits 1.375 bits
Mistral Large Instruct 2407 (123B) HF 🤗 4 bits 3 bits 2 bits (1) 2 bits (2) 1.875 bits 1.75 bits 1.625 bits 1.5 bits
Qwen 2.5 7B Instruct HF 🤗 4 bits 3 bits 2 bits (1) 2 bits (2) 2 bits (3)
Qwen 2.5 14B Instruct HF 🤗 4 bits 3 bits 2 bits (1) 2 bits (2) 2 bits (3)
Qwen 2.5 32B Instruct HF 🤗 4 bits 3 bits 2 bits (1) 2 bits (2) 2 bits (3)
Qwen 2.5 72B Instruct HF 🤗 4 bits 3 bits 2.38 bits 2.25 bits (1) 2.25 bits (2) 2 bits (1) 2 bits (2) 1.94 bits
Reproduced from the tech report HF 🤗 Results from the open source community for reference only, please use them responsibly.
Hessian and Inverse Hessian Matrix HF 🤗 Collected from RedPajama-Data-1T-Sample, following Quip#

Language Generation Example

To generate text using the pre-trained model, you can use the following code snippet:

The model VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft (~2 bit) is provided by open source community. The repository cannot guarantee the performance of those models.

python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft --prompt="Explain: Do Not Go Gentle into That Good Night"

Llama3 1-70b-prompt

Terminal Chatbot Example

Launching a chatbot: Note that you must use a chat model for this to work

python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft --chat

Llama3 1-70b-chat

Python API Example

Using the Python API:

import vptq
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft")
m = vptq.AutoModelForCausalLM.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft", device_map='auto')

inputs = tokenizer("Explain: Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
out = m.generate(**inputs, max_new_tokens=100, pad_token_id=2)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Gradio Web App Example

An environment variable is available to control share link or not. export SHARE_LINK=1

python -m vptq.app

Tech Report

VPTQ_tech_report

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.

Read tech report at Tech Report and arXiv Paper

Early Results from Tech Report

VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.

Model bitwidth W2↓ C4↓ AvgQA↑ tok/s↑ mem(GB) cost/h↓
LLaMA-2 7B 2.02 6.13 8.07 58.2 39.9 2.28 2
2.26 5.95 7.87 59.4 35.7 2.48 3.1
LLaMA-2 13B 2.02 5.32 7.15 62.4 26.9 4.03 3.2
2.18 5.28 7.04 63.1 18.5 4.31 3.6
LLaMA-2 70B 2.07 3.93 5.72 68.6 9.7 19.54 19
2.11 3.92 5.71 68.7 9.7 20.01 19

Road Map

  • Merge the quantization algorithm into the public repository.
  • Contribute the VPTQ method to various inference frameworks (e.g., vLLM, llama.cpp, exllama).
  • Improve the implementation of the inference kernel (e.g., CUDA, ROCm, Triton) and apply kernel fusion by combining dequantization (lookup) and Linear (GEMM) to enhance inference performance.
  • TBC

Project main members:

  • Yifei Liu (@lyf-00)
  • Jicheng Wen (@wejoncy)
  • Yang Wang (@YangWang92)

Acknowledgement

  • We thank for James Hensman for his crucial insights into the error analysis related to Vector Quantization (VQ), and his comments on LLMs evaluation are invaluable to this research.
  • We are deeply grateful for the inspiration provided by the papers QUIP, QUIP#, GPTVQ, AQLM, WoodFisher, GPTQ, and OBC.

Publication

EMNLP 2024 Main

@inproceedings{
  vptq,
  title={VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models},
  author={Yifei Liu and
          Jicheng Wen and
          Yang Wang and
          Shengyu Ye and
          Li Lyna Zhang and
          Ting Cao and
          Cheng Li and
          Mao Yang},
  booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},
  year={2024},
}

Star History

Star History Chart


Limitation of VPTQ

  • ⚠️ VPTQ should only be used for research and experimental purposes. Further testing and validation are needed before you use it.
  • ⚠️ The repository only provides a method of model quantization algorithm. The open-source community may provide models based on the technical report and quantization algorithm by themselves, but the repository cannot guarantee the performance of those models.
  • ⚠️ VPTQ is not capable of testing all potential applications and domains, and VPTQ cannot guarantee the accuracy and effectiveness of VPTQ across other tasks or scenarios.
  • ⚠️ Our tests are all based on English texts; other languages are not included in the current testing.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

vptq-0.0.3-cp312-cp312-manylinux1_x86_64.whl (24.8 MB view details)

Uploaded CPython 3.12

vptq-0.0.3-cp311-cp311-manylinux1_x86_64.whl (24.8 MB view details)

Uploaded CPython 3.11

vptq-0.0.3-cp310-cp310-manylinux1_x86_64.whl (24.8 MB view details)

Uploaded CPython 3.10

vptq-0.0.3-cp39-cp39-manylinux1_x86_64.whl (24.8 MB view details)

Uploaded CPython 3.9

vptq-0.0.3-cp38-cp38-manylinux1_x86_64.whl (24.8 MB view details)

Uploaded CPython 3.8

File details

Details for the file vptq-0.0.3-cp312-cp312-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vptq-0.0.3-cp312-cp312-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4c621e2dfee9b3bb2e6fcf85eb3f3ecac2cf37f1ec86b8533aedf769ede3aaa4
MD5 5cb64446aa77cf513ea901d8b1337d01
BLAKE2b-256 a77c306ea0672a44e647428453fc0d612a7177bcdb026c91906e1d1aebc5ee93

See more details on using hashes here.

File details

Details for the file vptq-0.0.3-cp311-cp311-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vptq-0.0.3-cp311-cp311-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 53b181753aad9991f04d44f66d7dea44a1c959e9e38d7fba89b64172f11d6430
MD5 12cc18432519cafc33ca4b8bd4ef4f97
BLAKE2b-256 a49028d81e0d198395e9b80366f2d49fca619b6c4058f62e82a64a1a416615ab

See more details on using hashes here.

File details

Details for the file vptq-0.0.3-cp310-cp310-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vptq-0.0.3-cp310-cp310-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b566b729f6053f58a538867d09b68ffb7d9489dcf2a607698c1d4182aca36174
MD5 1c873a084e01ac4d5ed58c4c8b48e541
BLAKE2b-256 3f102b7e7a1da4e7dfd095d2907d3671fefe6d6ac500781abc76a8cdc8dc45b0

See more details on using hashes here.

File details

Details for the file vptq-0.0.3-cp39-cp39-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vptq-0.0.3-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8d824c56bdc173d150b57c77cb49657de2419fc19d76fb55e41edb8224ba3dd8
MD5 23f53f2a345d03a63439a62870e3ecd2
BLAKE2b-256 2f0b225844d4008c47a0749f8c541a68ee1dce6b1d5cd49c7564816737ae7c29

See more details on using hashes here.

File details

Details for the file vptq-0.0.3-cp38-cp38-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vptq-0.0.3-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 67d865517ff959820326c5fe98010d280498193d1314e1391304f67ffbd1de83
MD5 c205f1b5e6e7d0b22e12cfe69f0e9aaa
BLAKE2b-256 2030d307a42718b2b080e3b6302049853686d311e53f9e7abfb587feac14478a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page