GPU inference for losslessly compressed (DFloat11) Large Language Models
Project description
DFloat11: Lossless LLM Compression for Efficient GPU Inference
DFloat11 is a lossless compression framework that reduces the size of Large Language Models (LLMs) by approximately 30% while preserving bit-for-bit identical outputs to the original model. It enables efficient GPU inference on resource-constrained hardware without sacrificing accuracy.
📰 News
- [05/05/2025] The
dfloat11pip package has been upgraded tov0.2.0! We have made the following important changes:- We added support for Qwen 3, Gemma 3, and Phi 4!
- The GPU decompression kernel is now 20-40% faster! We achieved it by improving thread occupancy and implementing tons of optimizations.
- The DFloat11 models are now stored in safetensors format for better safety and loading performance.
- When using a DFloat11 model, only the compressed model is downloaded, not the original model.
📦 Installation
Requires a CUDA-compatible GPU and PyTorch installed.
pip install dfloat11[cuda12]
# or if you have CUDA version 11:
# pip install dfloat11[cuda11]
🔍 How It Works
DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint.
Key benefits:
- No CPU decompression or host-device data transfer: all operations are handled entirely on the GPU.
- Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes.
- DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments.
- At batch size = 1, inference is approximately 2× slower than the original BF16 model, but the performance gap narrows significantly with larger batches.
- The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model.
🚀 Quick Start
- Install the
dfloat11pip package. See Installation. - Run the following code in Python, which automatically downloads the DFloat11
Qwen3-8Bmodel and generates a response.
import torch
from dfloat11 import DFloat11Model
from transformers import AutoTokenizer
model_id = "DFloat11/Qwen3-8B-DF11"
model = DFloat11Model.from_pretrained(model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
prompt = "Question: What is a binary tree and its applications? Answer:"
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
)
print(tokenizer.batch_decode(output, skip_special_tokens=True))
- Replace the
model_idin the script above with any pre-compressed model in the Model Hub.
🏎️ Benchmarking Performance
To test the speed and memory consumption a DFloat11 LLM during inference:
CUDA_VISIBLE_DEVICES=0 python inference.py \
--model_name_or_path DFloat11/Qwen3-8B-DF11 \
--prompt "Question: What is a binary tree and its applications? Answer:" \
--num_tokens 512 \
--batch_size 1
💡 Tip: If you specify multiple CUDA devices (e.g.,
CUDA_VISIBLE_DEVICES=0,1), the model will be automatically distributed across them using 🤗 Accelerate'sdevice_map="auto".
Arguments
--model_name_or_path: HuggingFace name or local path of the DFloat11 model (e.g.,DFloat11/Qwen3-8B-DF11). See the Model Hub section for a list of available DFloat11 models.--bf16: (Optional) Turn on this flag when passing a BFloat16 model to--model_name_or_path--prompt: Input prompt string for text generation--num_tokens: Number of new tokens to generate per sample--batch_size: Number of prompts to process in parallel--seed: (Optional) Random seed for reproducible results
Output
The script prints:
- Generated responses
- Total decoding latency
- Tokens per second (throughput)
- GPU memory usage (allocated and peak)
📚 Model Hub
| Model | DFloat11 Link |
|---|---|
| Qwen 3 32B | DFloat11/Qwen3-32B-DF11 |
| Qwen 3 14B | DFloat11/Qwen3-14B-DF11 |
| Qwen 3 8B | DFloat11/Qwen3-8B-DF11 |
| Qwen 3 4B | DFloat11/Qwen3-4B-DF11 |
| Phi 4 Reasoning Plus | DFloat11/Phi-4-reasoning-plus-DF11 |
| Gemma 3 27B Instruct | DFloat11/gemma-3-27b-it-DF11 |
| Gemma 3 12B Instruct | DFloat11/gemma-3-12b-it-DF11 |
| Gemma 3 4B Instruct | DFloat11/gemma-3-4b-it-DF11 |
| DeepSeek R1 Distill Qwen 32B | DFloat11/DeepSeek-R1-Distill-Qwen-32B-DF11 |
| DeepSeek R1 Distill Qwen 14B | DFloat11/DeepSeek-R1-Distill-Qwen-14B-DF11 |
| Discover more models on our HF page! | ... |
🔗 Links
👉 Explore pre-compressed DFloat11 models ready to use on HuggingFace: https://huggingface.co/DFloat11
📂 Official Code Repository: https://github.com/LeanModels/DFloat11
🧠 Contributions
This work is brought to you by the team at Rice University and xMAD.ai.
The GPU kernel was designed and implemented by Tianyi Zhang.
📚 Citation
If you found our work useful or interesting, please consider citing our paper:
@article{zhang2025dfloat11,
title={70\% Size, 100\% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float},
author={Zhang, Tianyi and Sui, Yang and Zhong, Shaochen and Chaudhary, Vipin and Hu, Xia and Shrivastava, Anshumali},
journal={arXiv preprint arXiv:2504.11651},
year={2025}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dfloat11-0.2.0.tar.gz.
File metadata
- Download URL: dfloat11-0.2.0.tar.gz
- Upload date:
- Size: 21.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34e472d4a3fe4678bc74f49cf46a7c11dadb6490a5ffdbbaaa779f1532c1f8f9
|
|
| MD5 |
020880e4be4f05a71d1eab8a0a5b197b
|
|
| BLAKE2b-256 |
6f8cb9344486a1fa0afb5cca3441c25c962c7ca4d077fb64f3ab81b2eb38aea2
|
File details
Details for the file dfloat11-0.2.0-py3-none-any.whl.
File metadata
- Download URL: dfloat11-0.2.0-py3-none-any.whl
- Upload date:
- Size: 18.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d81e3effbd41cd1e88ca2e318e067ff5134c3c3e3295b9c6a7d3b34d4abd9f34
|
|
| MD5 |
5e4d791f4c6b78daebd6d8ba9e12f420
|
|
| BLAKE2b-256 |
a520f8e77349dc54af6078fb781765f7694c6e8f2f9c5cdc4923d07cfa2be52c
|