Skip to main content

Library for utilization of compressed safetensors of neural network models

Project description

compressed-tensors

The compressed-tensors library extends the safetensors format, providing a versatile and efficient way to store and manage compressed tensor data. This library supports various quantization and sparsity schemes, making it a unified format for handling different model optimizations like GPTQ, AWQ, SmoothQuant, INT8, FP8, SparseGPT, and more.

Why compressed-tensors?

As model compression becomes increasingly important for efficient deployment of LLMs, the landscape of quantization and compression techniques has become increasingly fragmented. Each method often comes with its own storage format and loading procedures, making it challenging to work with multiple techniques or switch between them. compressed-tensors addresses this by providing a single, extensible format that can represent a wide variety of compression schemes.

  • Unified Checkpoint Format: Supports various compression schemes in a single, consistent format.
  • Wide Compatibility: Works with popular quantization methods like GPTQ, SmoothQuant, and FP8. See llm-compressor
  • Flexible Quantization Support:
    • Weight-only quantization (e.g., W4A16, W8A16, WnA16)
    • Activation quantization (e.g., W8A8)
    • KV cache quantization
    • Non-uniform schemes (different layers can be quantized in different ways!)
  • Sparsity Support: Handles both unstructured and semi-structured (e.g., 2:4) sparsity patterns.
  • Open-Source Integration: Designed to work seamlessly with Hugging Face models and PyTorch.

This allows developers and researchers to easily experiment with composing different quantization methods, simplify model deployment pipelines, and reduce the overhead of supporting multiple compression formats in inference engines.

Installation

From PyPI

Stable release:

pip install compressed-tensors

Nightly release:

pip install --pre compressed-tensors

From Source

git clone https://github.com/vllm-project/compressed-tensors
cd compressed-tensors
pip install -e .

Getting started

Saving a Compressed Model with PTQ

We can use compressed-tensors to run basic post training quantization (PTQ) and save the quantized model compressed on disk

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:0", torch_dtype="auto")

config = QuantizationConfig.parse_file("./examples/bit_packing/int4_config.json")
config.quantization_status = QuantizationStatus.CALIBRATION
apply_quantization_config(model, config)

dataset = load_dataset("ptb_text_only")["train"]
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding=False, truncation=True, max_length=1024)

tokenized_dataset = dataset.map(tokenize_function, batched=True)
data_loader = DataLoader(tokenized_dataset, batch_size=1, collate_fn=DefaultDataCollator())

with torch.no_grad():
    for idx, sample in tqdm(enumerate(data_loader), desc="Running calibration"):
        sample = {key: value.to(device) for key,value in sample.items()}
        _ = model(**sample)

        if idx >= 512:
            break

model.apply(freeze_module_quantization)
model.apply(compress_quantized_weights)

output_dir = "./ex_llama1.1b_w4a16_packed_quantize"
compressor = ModelCompressor.from_pretrained_model(model)
compressor.compress_model(model)
model.save_pretrained(output_dir)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

compressed_tensors-0.15.0.1.tar.gz (229.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

compressed_tensors-0.15.0.1-py3-none-any.whl (194.3 kB view details)

Uploaded Python 3

File details

Details for the file compressed_tensors-0.15.0.1.tar.gz.

File metadata

  • Download URL: compressed_tensors-0.15.0.1.tar.gz
  • Upload date:
  • Size: 229.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for compressed_tensors-0.15.0.1.tar.gz
Algorithm Hash digest
SHA256 a8e93054e8a5ec49c980b09ed36c4c1249b4a8ee167920a8e461c4da26e78d99
MD5 f772cb6e069c8c4962adbb6635b48feb
BLAKE2b-256 411bc3c4a98ec5f2727656336f07a0c35862195c310d8eb0b2fa5b4be6848680

See more details on using hashes here.

Provenance

The following attestation bundles were made for compressed_tensors-0.15.0.1.tar.gz:

Publisher: compressed-tensors-upload.yml on neuralmagic/llm-compressor-testing

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file compressed_tensors-0.15.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for compressed_tensors-0.15.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e1b1f322e82e475715e242bad46925a304ea8e5c98b5055a15b8eb22fb6bfea9
MD5 4edf6516bbdc3bf3038b8918c3f77ce2
BLAKE2b-256 a85293833dc1610e017ac5b7dcd59b8304d8ef67d1114c2d124e728a2cbbea12

See more details on using hashes here.

Provenance

The following attestation bundles were made for compressed_tensors-0.15.0.1-py3-none-any.whl:

Publisher: compressed-tensors-upload.yml on neuralmagic/llm-compressor-testing

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page