Library for utilization of compressed safetensors of neural network models
Project description
compressed-tensors
The compressed-tensors library extends the safetensors format, providing a versatile and efficient way to store and manage compressed tensor data. This library supports various quantization and sparsity schemes, making it a unified format for handling different model optimizations like GPTQ, AWQ, SmoothQuant, INT8, FP8, SparseGPT, and more.
Why compressed-tensors?
As model compression becomes increasingly important for efficient deployment of LLMs, the landscape of quantization and compression techniques has become increasingly fragmented.
Each method often comes with its own storage format and loading procedures, making it challenging to work with multiple techniques or switch between them.
compressed-tensors addresses this by providing a single, extensible format that can represent a wide variety of compression schemes.
- Unified Checkpoint Format: Supports various compression schemes in a single, consistent format.
- Wide Compatibility: Works with popular quantization methods like GPTQ, SmoothQuant, and FP8. See llm-compressor
- Flexible Quantization Support:
- Weight-only quantization (e.g., W4A16, W8A16, WnA16)
- Activation quantization (e.g., W8A8)
- KV cache quantization
- Non-uniform schemes (different layers can be quantized in different ways!)
- Sparsity Support: Handles both unstructured and semi-structured (e.g., 2:4) sparsity patterns.
- Open-Source Integration: Designed to work seamlessly with Hugging Face models and PyTorch.
This allows developers and researchers to easily experiment with composing different quantization methods, simplify model deployment pipelines, and reduce the overhead of supporting multiple compression formats in inference engines.
Installation
From PyPI
Stable release:
pip install compressed-tensors
Nightly release:
pip install --pre compressed-tensors
From Source
git clone https://github.com/vllm-project/compressed-tensors
cd compressed-tensors
pip install -e .
Getting started
Saving a Compressed Model with PTQ
We can use compressed-tensors to run basic post training quantization (PTQ) and save the quantized model compressed on disk
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:0", torch_dtype="auto")
config = QuantizationConfig.parse_file("./examples/bit_packing/int4_config.json")
config.quantization_status = QuantizationStatus.CALIBRATION
apply_quantization_config(model, config)
dataset = load_dataset("ptb_text_only")["train"]
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_function(examples):
return tokenizer(examples["sentence"], padding=False, truncation=True, max_length=1024)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
data_loader = DataLoader(tokenized_dataset, batch_size=1, collate_fn=DefaultDataCollator())
with torch.no_grad():
for idx, sample in tqdm(enumerate(data_loader), desc="Running calibration"):
sample = {key: value.to(device) for key,value in sample.items()}
_ = model(**sample)
if idx >= 512:
break
model.apply(freeze_module_quantization)
model.apply(compress_quantized_weights)
output_dir = "./ex_llama1.1b_w4a16_packed_quantize"
compressor = ModelCompressor.from_pretrained_model(model)
compressor.compress_model(model)
model.save_pretrained(output_dir)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file compressed_tensors-0.15.0.1.tar.gz.
File metadata
- Download URL: compressed_tensors-0.15.0.1.tar.gz
- Upload date:
- Size: 229.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8e93054e8a5ec49c980b09ed36c4c1249b4a8ee167920a8e461c4da26e78d99
|
|
| MD5 |
f772cb6e069c8c4962adbb6635b48feb
|
|
| BLAKE2b-256 |
411bc3c4a98ec5f2727656336f07a0c35862195c310d8eb0b2fa5b4be6848680
|
Provenance
The following attestation bundles were made for compressed_tensors-0.15.0.1.tar.gz:
Publisher:
compressed-tensors-upload.yml on neuralmagic/llm-compressor-testing
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
compressed_tensors-0.15.0.1.tar.gz -
Subject digest:
a8e93054e8a5ec49c980b09ed36c4c1249b4a8ee167920a8e461c4da26e78d99 - Sigstore transparency entry: 1271592980
- Sigstore integration time:
-
Permalink:
neuralmagic/llm-compressor-testing@a2789492c12f83718a78b2f01004eee725adab6b -
Branch / Tag:
refs/heads/main - Owner: https://github.com/neuralmagic
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
compressed-tensors-upload.yml@a2789492c12f83718a78b2f01004eee725adab6b -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file compressed_tensors-0.15.0.1-py3-none-any.whl.
File metadata
- Download URL: compressed_tensors-0.15.0.1-py3-none-any.whl
- Upload date:
- Size: 194.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1b1f322e82e475715e242bad46925a304ea8e5c98b5055a15b8eb22fb6bfea9
|
|
| MD5 |
4edf6516bbdc3bf3038b8918c3f77ce2
|
|
| BLAKE2b-256 |
a85293833dc1610e017ac5b7dcd59b8304d8ef67d1114c2d124e728a2cbbea12
|
Provenance
The following attestation bundles were made for compressed_tensors-0.15.0.1-py3-none-any.whl:
Publisher:
compressed-tensors-upload.yml on neuralmagic/llm-compressor-testing
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
compressed_tensors-0.15.0.1-py3-none-any.whl -
Subject digest:
e1b1f322e82e475715e242bad46925a304ea8e5c98b5055a15b8eb22fb6bfea9 - Sigstore transparency entry: 1271592995
- Sigstore integration time:
-
Permalink:
neuralmagic/llm-compressor-testing@a2789492c12f83718a78b2f01004eee725adab6b -
Branch / Tag:
refs/heads/main - Owner: https://github.com/neuralmagic
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
compressed-tensors-upload.yml@a2789492c12f83718a78b2f01004eee725adab6b -
Trigger Event:
workflow_dispatch
-
Statement type: