Skip to main content

Library for utilization of compressed safetensors of neural network models

Project description

compressed_tensors

This repository extends a safetensors format to efficiently store sparse and/or quantized tensors on disk. compressed-tensors format supports multiple compression types to minimize the disk space and facilitate the tensor manipulation.

Motivation

Reduce disk space by saving sparse tensors in a compressed format

The compressed format stores the data much more efficiently by taking advantage of two properties of tensors:

  • Sparse tensors -> due to a large number of entries that are equal to zero.
  • Quantized -> due to their low precision representation.

Introduce an elegant interface to save/load compressed tensors

The library provides the user with the ability to compress/decompress tensors. The properties of tensors are defined by human-readable configs, allowing the users to understand the compression format at a quick glance.

Installation

Pip

pip install compressed-tensors

From source

git clone https://github.com/neuralmagic/compressed-tensors
cd compressed-tensors
pip install -e .

Getting started

Saving/Loading Compressed Tensors (Bitmask Compression)

The function save_compressed uses the compression_format argument to apply compression to tensors. The function load_compressed reverses the process: converts the compressed weights on disk to decompressed weights in device memory.

from compressed_tensors import save_compressed, load_compressed, BitmaskConfig
from torch import Tensor
from typing import Dict

# the example BitmaskConfig method efficiently compresses 
# tensors with large number of zero entries 
compression_config = BitmaskConfig()

tensors: Dict[str, Tensor] = {"tensor_1": Tensor(
    [[0.0, 0.0, 0.0], 
     [1.0, 1.0, 1.0]]
)}
# compress tensors using BitmaskConfig compression format (save them efficiently on disk)
save_compressed(tensors, "model.safetensors", compression_format=compression_config.format)

# decompress tensors (load_compressed returns a generator for memory efficiency)
decompressed_tensors = {}
for tensor_name, tensor in load_compressed("model.safetensors", compression_config = compression_config):
    decompressed_tensors[tensor_name] = tensor

Saving/Loading Compressed Models (Bitmask Compression)

We can apply bitmask compression to a whole model. For more detailed example see example directory.

from compressed_tensors import save_compressed_model, load_compressed, BitmaskConfig
from transformers import AutoModelForCausalLM

model_name = "neuralmagic/llama2.c-stories110M-pruned50"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")

original_state_dict = model.state_dict()

compression_config = BitmaskConfig()

# save compressed model weights
save_compressed_model(model, "compressed_model.safetensors", compression_format=compression_config.format)

# load compressed model weights (`dict` turns generator into a dictionary)
state_dict = dict(load_compressed("compressed_model.safetensors", compression_config))

For more in-depth tutorial on bitmask compression, refer to the notebook.

Saving a Compressed Model with PTQ

We can use compressed-tensors to run basic post training quantization (PTQ) and save the quantized model compressed on disk

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:0", torch_dtype="auto")

config = QuantizationConfig.parse_file("./examples/bit_packing/int4_config.json")
config.quantization_status = QuantizationStatus.CALIBRATION
apply_quantization_config(model, config)

dataset = load_dataset("ptb_text_only")["train"]
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding=False, truncation=True, max_length=1024)

tokenized_dataset = dataset.map(tokenize_function, batched=True)
data_loader = DataLoader(tokenized_dataset, batch_size=1, collate_fn=DefaultDataCollator())

with torch.no_grad():
    for idx, sample in tqdm(enumerate(data_loader), desc="Running calibration"):
        sample = {key: value.to(device) for key,value in sample.items()}
        _ = model(**sample)

        if idx >= 512:
            break

model.apply(freeze_module_quantization)
model.apply(compress_quantized_weights)

output_dir = "./ex_llama1.1b_w4a16_packed_quantize"
compressor = ModelCompressor(quantization_config=config)
compressed_state_dict = compressor.compress(model)
model.save_pretrained(output_dir, state_dict=compressed_state_dict)

For more in-depth tutorial on quantization compression, refer to the notebook.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

compressed-tensors-0.4.0.tar.gz (45.7 kB view details)

Uploaded Source

Built Distribution

compressed_tensors-0.4.0-py3-none-any.whl (75.8 kB view details)

Uploaded Python 3

File details

Details for the file compressed-tensors-0.4.0.tar.gz.

File metadata

  • Download URL: compressed-tensors-0.4.0.tar.gz
  • Upload date:
  • Size: 45.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for compressed-tensors-0.4.0.tar.gz
Algorithm Hash digest
SHA256 217cfea0f20d1a9346b45b11554b9fd909997b50e81cf4c43b17d06b63a19434
MD5 46b0c4e7a652ceca35084113870e9017
BLAKE2b-256 4f249b665a5461151cf42a19d907e314e1c26352b442d5c104b4b326c4265cf5

See more details on using hashes here.

File details

Details for the file compressed_tensors-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: compressed_tensors-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 75.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.10

File hashes

Hashes for compressed_tensors-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f40756ffffcc2256eb3ccf4a9d0dabe197bb067c60bb7654f642085f7c67e4d3
MD5 ec9ede18aa0bf77fa8217a6d966317b4
BLAKE2b-256 fc680e796b66a8fa23000c9b5957c94c69f2aa84e80cb0d6856df5630f7f9961

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page