Skip to main content

A library for compressing large language models utilizing the latest techniques and research in the field for both training aware and post training techniques. The library is designed to be flexible and easy to use on top of PyTorch and HuggingFace Transformers, allowing for quick experimentation.

Project description

LLM Compressor

llm-compressor is an easy-to-use library for optimizing models for deployment with vllm, including:

  • Comprehensive set of quantization algorithms including weight-only and activation quantization
  • Seemless integration Hugging Face models and repositories
  • safetensors-based file format compatible with vllm

LLM Compressor Flow

Supported Formats

  • Mixed Precision: W4A16, W8A16
  • Activation Quantization: W8A8 (int8 and fp8)
  • 2:4 Semi-structured Sparsity
  • Unstructured Sparsity

Supported Algorithms

  • PTQ (Post Training Quantization)
  • GPTQ
  • SmoothQuant
  • SparseGPT

Installation

llm-compressor can be installed from the source code via a git clone and local pip install.

git clone https://github.com/vllm-project/llm-compressor.git
pip install -e llm-compressor

Quick Tour

The following snippet is a minimal example with 4-bit weight-only quantization via GPTQ and inference of a TinyLlama/TinyLlama-1.1B-Chat-v1.0. Note that the model can be swapped for a local or remote HF-compatible checkpoint and the recipe may be changed to target different quantization algorithms or formats.

Compression

Compression is easily applied by selecting an algorithm (GPTQ) and calling the oneshot API.

from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization.gptq import GPTQModifier

# Sets parameters for the GPTQ algorithms - target Linear layer weights at 4 bits
recipe = GPTQModifier(scheme="W4A16", targets="Linear", ignore=["lm_head"])

# Apply GPTQ algorithm using open_platypus dataset for calibration.
oneshot(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    dataset="open_platypus",
    recipe=recipe,
    save_compressed=True,
    output_dir="llama-compressed-quickstart",
    overwrite_output_dir=True,
    max_seq_length=2048,
    num_calibration_samples=512,
)

Inference with vLLM

The checkpoint is ready to run with vLLM (after install pip install vllm).

from vllm import LLM

model = LLM("llama-compressed-quickstart")
output = model.generate("I love 4 bit models because")

End-to-End Examples

The llm-compressor library provides a rich feature-set for model compression. Below are examples and documentation of a few key flows:

If you have any questions or requests open an issue and we will add an example or documentation.

Contribute

We appreciate contributions to the code, examples, integrations, and documentation as well as bug reports and feature requests! Learn how here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmcompressor-0.1.0.tar.gz (156.0 kB view hashes)

Uploaded Source

Built Distribution

llmcompressor-0.1.0-py3-none-any.whl (204.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page