Efficiently run models quantized with SPQR
Project description
Efficient SpQR Inference Kernel
Note: This repository contains the single-batch inference kernel for a model quantizated via the SpQR algorithm with a specific 16x16 tile and 3-bit configuration in mind, alsongside unstructured sparsity. The compression algorithm is detailed in the research paper "SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression".
Installation
Packages
To install spqr_quant, run the following.
pip install -e .
Pre-Requisites for Running the Conversion Scripts, Tests and Benchmarks
In order to run the benchmark and test suite you need to build the sources used by these scripts. You can do so by running the following command:
/bin/bash scripts/build.sh
which simply runs the setup.py script.
Conversion From Legacy to Optimized SPQR Storage
After running SpQR which produces the tensors stored in int8, in order to run the efficient inference kernels, one must convert the tensors produces by SpQR (legacy tensors) into the optimized storage format used by the cuda kernel. In order to do so, run the following script:
usage: convert_legacy_model_format.py [-h] --base_model BASE_MODEL --legacy_model_path LEGACY_MODEL_PATH [--sparse_strategy {csr,ptcsr,optimize_latency}] [--save_pt SAVE_PT] [--save_per_layer SAVE_PER_LAYER]
options:
-h, --help show this help message and exit
--base_model BASE_MODEL
path or name of the unquantized model
--legacy_model_path LEGACY_MODEL_PATH
path to legacy model
--sparse_strategy {csr,ptcsr,optimize_latency}
Sparse strategy storage. Options: csr, ptcsr, auto. CSR - Compressed Sparse Rows PTCSR - Alternative storage format optimize_latency - Use the current GPU to determine the optimal storage format to reduce
kernel latency
--save_pt SAVE_PT Save the converted quantized .pt model here
--save_per_layer SAVE_PER_LAYER
Save the converted quantized m
Inference
The file inference_demo.py demos the functionality of this inference kernel in the context of
running end-to-end model inference. Below is a description of how to launch it.
usage: inference_demo.py [-h] [--pretrained_model_path PRETRAINED_MODEL_PATH] [--compressed_model_path COMPRESSED_MODEL_PATH] --execution_mode {0,1}
options:
-h, --help show this help message and exit
--pretrained_model_path PRETRAINED_MODEL_PATH
Path to the model to the pretrained model
--compressed_model_path COMPRESSED_MODEL_PATH
Path to the compressed .pt model
--execution_mode {0,1}
If set to 0, will evaluate the dense pretrained model. If set to 1, will evaluate the spqr-quantized model
This script also reports the mean median and minimimum time of the forward() passes and the total inference execution time.
Hugginface Conversion
To convert a model into a Hugging Face compatible format, use convert_to_hf.py script:
usage: convert_to_hf.py [-h] [--model MODEL] [--config_path CONFIG_PATH] [--in_path_pt IN_PATH_PT] [--out_path OUT_PATH] [--save_safetensors] [--trust_remote_code] [--load_model] [--save_tokenizer]
options:
-h, --help show this help message and exit
--model MODEL Path to the model to base config on, as in AutoConfig.from_pretrained()
--config_path CONFIG_PATH
Path to the model to base config on, as in AutoConfig.from_pretrained()
--in_path_pt IN_PATH_PT
Path of the checkpoint to convert
--out_path OUT_PATH Path to save HF compatible checkpoint to
--save_safetensors Whether to save in safetensors format
--trust_remote_code Whether to trust remote code
--load_model Whether to load model
--save_tokenizer Whether to save tokenizer
Benchmarks (matvec kernel)
In order to run the matvec benchmark suite, one should run:
bench_spqr.py [-h] --tensor_path TENSOR_PATH [--ptcsr_path PTCSR_PATH] [--output_path OUTPUT_PATH]
options:
-h, --help show this help message and exit
--tensor_path TENSOR_PATH
Path to folder containing the tensors of the formmodel_path/ 0/ tensor0 tensor1
--ptcsr_path PTCSR_PATH
Path to folder containing the tensors of the formmodel_path/ 0/ tensor0 tensor1
--output_path OUTPUT_PATH
Path to results *.csv file.
Make sure that the <tensor_path> and the optional <ptcsr_path. point to a folder containing quantized matrices produced by the convert_legacy_model_format.py script.
Use <cuda_device_id> to set the cuda device during benchmark. The script outputs the results in <results_output>.
Tests
In order to run the unittest, simply execute:
python3 tests/test.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spqr_quant-0.1.0.tar.gz.
File metadata
- Download URL: spqr_quant-0.1.0.tar.gz
- Upload date:
- Size: 21.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90b35fa51ce2ac319547c40bbc75ebb44e5c183fbcc886a3f520e8ec39f6adc8
|
|
| MD5 |
581ec0be3241718d05e203da2945c0cd
|
|
| BLAKE2b-256 |
4c0121c3160d948b9b1920310bb65271abe16a9e91e7825ca94904a271a98bf2
|
File details
Details for the file spqr_quant-0.1.0-py3-none-any.whl.
File metadata
- Download URL: spqr_quant-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5b99a26f11e6af1d4fbf710f7a305c1f9cc9b118949f52e15e082b64aa97078
|
|
| MD5 |
0228eef3aa50ba17afe162abaa1238e0
|
|
| BLAKE2b-256 |
0f69395fe88be005448e1008407fa0d6aeccbcad2cd19252be2d4a34df74b0ba
|