Skip to main content

Library for utilization of compressed safetensors of neural network models

Project description

compressed-tensors

The compressed-tensors library extends the safetensors format, providing a versatile and efficient way to store and manage compressed tensor data. This library supports various quantization and sparsity schemes, making it a unified format for handling different model optimizations like GPTQ, AWQ, SmoothQuant, INT8, FP8, SparseGPT, and more.

Why compressed-tensors?

As model compression becomes increasingly important for efficient deployment of LLMs, the landscape of quantization and compression techniques has become increasingly fragmented. Each method often comes with its own storage format and loading procedures, making it challenging to work with multiple techniques or switch between them. compressed-tensors addresses this by providing a single, extensible format that can represent a wide variety of compression schemes.

  • Unified Checkpoint Format: Supports various compression schemes in a single, consistent format.
  • Wide Compatibility: Works with popular quantization methods like GPTQ, SmoothQuant, and FP8. See llm-compressor
  • Flexible Quantization Support:
    • Weight-only quantization (e.g., W4A16, W8A16, WnA16)
    • Activation quantization (e.g., W8A8)
    • KV cache quantization
    • Non-uniform schemes (different layers can be quantized in different ways!)
  • Sparsity Support: Handles both unstructured and semi-structured (e.g., 2:4) sparsity patterns.
  • Open-Source Integration: Designed to work seamlessly with Hugging Face models and PyTorch.

This allows developers and researchers to easily experiment with composing different quantization methods, simplify model deployment pipelines, and reduce the overhead of supporting multiple compression formats in inference engines.

Installation

From PyPI

Stable release:

pip install compressed-tensors

Nightly release:

pip install compressed-tensors-nightly

From Source

git clone https://github.com/neuralmagic/compressed-tensors
cd compressed-tensors
pip install -e .

Getting started

Saving/Loading Compressed Tensors (Bitmask Compression)

The function save_compressed uses the compression_format argument to apply compression to tensors. The function load_compressed reverses the process: converts the compressed weights on disk to decompressed weights in device memory.

from compressed_tensors import save_compressed, load_compressed, BitmaskConfig
from torch import Tensor
from typing import Dict

# the example BitmaskConfig method efficiently compresses 
# tensors with large number of zero entries 
compression_config = BitmaskConfig()

tensors: Dict[str, Tensor] = {"tensor_1": Tensor(
    [[0.0, 0.0, 0.0], 
     [1.0, 1.0, 1.0]]
)}
# compress tensors using BitmaskConfig compression format (save them efficiently on disk)
save_compressed(tensors, "model.safetensors", compression_format=compression_config.format)

# decompress tensors (load_compressed returns a generator for memory efficiency)
decompressed_tensors = {}
for tensor_name, tensor in load_compressed("model.safetensors", compression_config = compression_config):
    decompressed_tensors[tensor_name] = tensor

Saving/Loading Compressed Models (Bitmask Compression)

We can apply bitmask compression to a whole model. For more detailed example see example directory.

from compressed_tensors import save_compressed_model, load_compressed, BitmaskConfig
from transformers import AutoModelForCausalLM

model_name = "neuralmagic/llama2.c-stories110M-pruned50"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")

original_state_dict = model.state_dict()

compression_config = BitmaskConfig()

# save compressed model weights
save_compressed_model(model, "compressed_model.safetensors", compression_format=compression_config.format)

# load compressed model weights (`dict` turns generator into a dictionary)
state_dict = dict(load_compressed("compressed_model.safetensors", compression_config))

For more in-depth tutorial on bitmask compression, refer to the notebook.

Saving a Compressed Model with PTQ

We can use compressed-tensors to run basic post training quantization (PTQ) and save the quantized model compressed on disk

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:0", torch_dtype="auto")

config = QuantizationConfig.parse_file("./examples/bit_packing/int4_config.json")
config.quantization_status = QuantizationStatus.CALIBRATION
apply_quantization_config(model, config)

dataset = load_dataset("ptb_text_only")["train"]
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding=False, truncation=True, max_length=1024)

tokenized_dataset = dataset.map(tokenize_function, batched=True)
data_loader = DataLoader(tokenized_dataset, batch_size=1, collate_fn=DefaultDataCollator())

with torch.no_grad():
    for idx, sample in tqdm(enumerate(data_loader), desc="Running calibration"):
        sample = {key: value.to(device) for key,value in sample.items()}
        _ = model(**sample)

        if idx >= 512:
            break

model.apply(freeze_module_quantization)
model.apply(compress_quantized_weights)

output_dir = "./ex_llama1.1b_w4a16_packed_quantize"
compressor = ModelCompressor(quantization_config=config)
compressed_state_dict = compressor.compress(model)
model.save_pretrained(output_dir, state_dict=compressed_state_dict)

For more in-depth tutorial on quantization compression, refer to the notebook.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

compressed_tensors_nightly-0.9.3.20250404.tar.gz (66.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file compressed_tensors_nightly-0.9.3.20250404.tar.gz.

File metadata

File hashes

Hashes for compressed_tensors_nightly-0.9.3.20250404.tar.gz
Algorithm Hash digest
SHA256 d0dfa4c2a823a39f534c6f3bf0f73a5b68355d3bf230b8fd6b0a5068e92f5c9f
MD5 dd367ac2dd9bef870fec1de011e750e7
BLAKE2b-256 ea82888fde823959b8eeed6e139f2f2a9449da5d651ac67fcc11a90b9cd6d9e3

See more details on using hashes here.

File details

Details for the file compressed_tensors_nightly-0.9.3.20250404-py3-none-any.whl.

File metadata

File hashes

Hashes for compressed_tensors_nightly-0.9.3.20250404-py3-none-any.whl
Algorithm Hash digest
SHA256 31dde48894ee16e0385636110adce63d90dd098d626e2c15735f10c1575aafab
MD5 65061116acefb238525199e11f0b2e30
BLAKE2b-256 8b069ccc5cf150e8aa9aa6d1fcc25b15847424c27d217116499bab894dafcf40

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page