Efficiently run models quantized with SPQR

These details have not been verified by PyPI

Project links

Project description

Efficient SpQR Inference Kernel

Note: This repository contains the single-batch inference kernel for a model quantizated via the SpQR algorithm with a specific 16x16 tile and 3-bit configuration in mind, alsongside unstructured sparsity. The compression algorithm is detailed in the research paper "SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression".

Installation

Packages

To install spqr_quant, run the following.

pip install -e .

Pre-Requisites for Running the Conversion Scripts, Tests and Benchmarks

In order to run the benchmark and test suite you need to build the sources used by these scripts. You can do so by running the following command:

/bin/bash scripts/build.sh

which simply runs the setup.py script.

Conversion From Legacy to Optimized SPQR Storage

After running SpQR which produces the tensors stored in int8, in order to run the efficient inference kernels, one must convert the tensors produces by SpQR (legacy tensors) into the optimized storage format used by the cuda kernel. In order to do so, run the following script:

usage: convert_legacy_model_format.py [-h] --base_model BASE_MODEL --legacy_model_path LEGACY_MODEL_PATH [--sparse_strategy {csr,ptcsr,optimize_latency}] [--save_pt SAVE_PT] [--save_per_layer SAVE_PER_LAYER]

options:
  -h, --help            show this help message and exit
  --base_model BASE_MODEL
                        path or name of the unquantized model
  --legacy_model_path LEGACY_MODEL_PATH
                        path to legacy model
  --sparse_strategy {csr,ptcsr,optimize_latency}
                        Sparse strategy storage. Options: csr, ptcsr, auto. CSR - Compressed Sparse Rows PTCSR - Alternative storage format optimize_latency - Use the current GPU to determine the optimal storage format to reduce
                        kernel latency
  --save_pt SAVE_PT     Save the converted quantized .pt model here
  --save_per_layer SAVE_PER_LAYER
                        Save the converted quantized m

Inference

The file inference_demo.py demos the functionality of this inference kernel in the context of running end-to-end model inference. Below is a description of how to launch it.

usage: inference_demo.py [-h] [--pretrained_model_path PRETRAINED_MODEL_PATH] [--compressed_model_path COMPRESSED_MODEL_PATH] --execution_mode {0,1}

options:
  -h, --help            show this help message and exit
  --pretrained_model_path PRETRAINED_MODEL_PATH
                        Path to the model to the pretrained model
  --compressed_model_path COMPRESSED_MODEL_PATH
                        Path to the compressed .pt model
  --execution_mode {0,1}
                        If set to 0, will evaluate the dense pretrained model. If set to 1, will evaluate the spqr-quantized model

This script also reports the mean median and minimimum time of the forward() passes and the total inference execution time.

Hugginface Conversion

To convert a model into a Hugging Face compatible format, use convert_to_hf.py script:

usage: convert_to_hf.py [-h] [--model MODEL] [--config_path CONFIG_PATH] [--in_path_pt IN_PATH_PT] [--out_path OUT_PATH] [--save_safetensors] [--trust_remote_code] [--load_model] [--save_tokenizer]

options:
  -h, --help            show this help message and exit
  --model MODEL         Path to the model to base config on, as in AutoConfig.from_pretrained()
  --config_path CONFIG_PATH
                        Path to the model to base config on, as in AutoConfig.from_pretrained()
  --in_path_pt IN_PATH_PT
                        Path of the checkpoint to convert
  --out_path OUT_PATH   Path to save HF compatible checkpoint to
  --save_safetensors    Whether to save in safetensors format
  --trust_remote_code   Whether to trust remote code
  --load_model          Whether to load model
  --save_tokenizer      Whether to save tokenizer

Benchmarks (matvec kernel)

In order to run the matvec benchmark suite, one should run:

bench_spqr.py [-h] --tensor_path TENSOR_PATH [--ptcsr_path PTCSR_PATH] [--output_path OUTPUT_PATH]

options:
  -h, --help            show this help message and exit
  --tensor_path TENSOR_PATH
                        Path to folder containing the tensors of the formmodel_path/ 0/ tensor0 tensor1
  --ptcsr_path PTCSR_PATH
                        Path to folder containing the tensors of the formmodel_path/ 0/ tensor0 tensor1
  --output_path OUTPUT_PATH
                        Path to results *.csv file.

Make sure that the <tensor_path> and the optional <ptcsr_path. point to a folder containing quantized matrices produced by the convert_legacy_model_format.py script. Use <cuda_device_id> to set the cuda device during benchmark. The script outputs the results in <results_output>.

Tests

In order to run the unittest, simply execute:

python3 tests/test.py

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Dec 2, 2024

0.1.0

Nov 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spqr_quant-0.2.0.tar.gz (20.3 kB view details)

Uploaded Dec 2, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spqr_quant-0.2.0-py3-none-any.whl (20.6 kB view details)

Uploaded Dec 2, 2024 Python 3

File details

Details for the file spqr_quant-0.2.0.tar.gz.

File metadata

Download URL: spqr_quant-0.2.0.tar.gz
Upload date: Dec 2, 2024
Size: 20.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for spqr_quant-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`11e47bb495dbb7a3d832c05a46c76bf9185542d6a466533cfe3c14c2059337c2`
MD5	`c081402f53d2c8654cb80e6af82986ed`
BLAKE2b-256	`49d7fee9dfd2dadf1a4959ec7b4010eea866c810d14cac5f7645acd8a0606957`

See more details on using hashes here.

File details

Details for the file spqr_quant-0.2.0-py3-none-any.whl.

File metadata

Download URL: spqr_quant-0.2.0-py3-none-any.whl
Upload date: Dec 2, 2024
Size: 20.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for spqr_quant-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`407372f87dccd41d5f93d17b52137912e15a14a4ab28f52b6e74482c0956aff4`
MD5	`cef92d62b458538b0a0e36e00671a4ee`
BLAKE2b-256	`187fc05258cf7c31fde5e4c1f1dbdb35ed2741fc69e8511e19d6b58ea61886ad`

See more details on using hashes here.

spqr-quant 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Efficient SpQR Inference Kernel

Installation

Packages

Pre-Requisites for Running the Conversion Scripts, Tests and Benchmarks

Conversion From Legacy to Optimized SPQR Storage

Inference

Hugginface Conversion

Benchmarks (matvec kernel)

Tests

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes