Skip to main content

The inference kernels for LeanQuant models.

Project description

LeanQuant

Overview

This package provides efficient inference kernels for running non-uniformly quantized LeanQuant models on CUDA-enabled GPUs. LeanQuant is a scalable and accurate quantization algorithm that compresses large language models by 4-8x while maintaining competitive performance.

Installation

Ensure your GPU supports CUDA 11 or CUDA 12. You can check your CUDA version with the command nvidia-smi | grep CUDA.

To install:

# For CUDA 11.x
pip install leanquant[cuda11]

# For CUDA 12.x
pip install leanquant[cuda12]

Models

Quantized LeanQuant models are available for download on our HuggingFace page: huggingface.co/LeanQuant

Technical Details

LeanQuant introduces an innovative loss-error-aware grid approach to quantization that significantly outperforms traditional methods. Our technique:

  • Achieves superior compression ratios: Reduces model size by 4-8x without sacrificing capability
  • Preserves model intelligence: Maintains performance comparable to full-precision models across challenging benchmarks
  • Optimizes GPU execution: Features custom CUDA kernels specifically designed for non-uniform quantization format

The algorithm strategically allocates quantization precision based on parameter sensitivity, ensuring computational resources are focused where they matter most.

For a comprehensive explanation of our methodology and benchmark results, please refer to our research paper.

Citation

If you find LeanQuant useful in your research or applications, please consider citing our work:

@inproceedings{
    zhang2025leanquant,
    title={LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid},
    author={Tianyi Zhang and Anshumali Shrivastava},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=ISqx8giekS}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leanquant-0.1.1.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leanquant-0.1.1-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file leanquant-0.1.1.tar.gz.

File metadata

  • Download URL: leanquant-0.1.1.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for leanquant-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5b269964faa392adffdd07f07528debecbe43ba27d9b64f0ae149a6450068c35
MD5 a04c2040ee4121ce3462c66d88b5bea3
BLAKE2b-256 e70eee91cad8fff5cdc9c543f17eba09c80a374ebcdf5836662dfbeb737d4f57

See more details on using hashes here.

File details

Details for the file leanquant-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: leanquant-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 5.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for leanquant-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ba123232c54910ae7985cbf3d6109133292d42fca7b72710535454760a961b61
MD5 4f44fbe76dfef27b1b8b286b94d8a414
BLAKE2b-256 83212839ba4a349c27a971ecae978986949a0180f06c1ceec806df012f4e3e13

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page