High-performance inference engine for Diffusion Language Models - 3x faster with advanced optimizations
Project description
DFastLLM
🚀 High-Performance Inference Engine for Diffusion Language Models
Quick Start • Features • Performance • Documentation • Contributing
🎯 What is DFastLLM?
DFastLLM is a production-ready inference engine optimized for Diffusion Language Models (LLaDA, Dream, MDLM). Unlike autoregressive models that generate tokens sequentially, diffusion LLMs generate multiple tokens in parallel through iterative denoising — enabling massive throughput gains.
Traditional LLM: Token → Token → Token → Token (sequential)
Diffusion LLM: [████████] → [████████] → Done! (parallel)
⚡ Quick Start
Installation
pip install dfastllm
Generate Text
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dfastllm.engine.diffusion import DiffusionEngine
# Load model
model = AutoModelForCausalLM.from_pretrained(
"GSAI-ML/LLaDA-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("GSAI-ML/LLaDA-8B-Instruct", trust_remote_code=True)
# Create engine and generate
engine = DiffusionEngine(model, tokenizer)
output = engine.generate("What is artificial intelligence?", max_tokens=64)
print(output)
With Quantization (2x Memory Savings)
pip install dfastllm[quantization]
from dfastllm import load_quantized_model
# Load 8B model in ~8GB instead of ~16GB
model = load_quantized_model("GSAI-ML/LLaDA-8B-Instruct", "int4")
Batch Processing (4x Throughput)
prompts = ["What is AI?", "What is ML?", "What is DL?", "What is NLP?"]
outputs = engine.generate(prompts, max_tokens=64)
# Throughput: 1,000+ tok/s
🔥 Features
| Feature | Description | Status |
|---|---|---|
| Diffusion Generation | Parallel token unmasking | ✅ |
| Batch Processing | Process multiple requests | ✅ |
| INT4/INT8 Quantization | 2-4x memory reduction | ✅ |
| torch.compile | JIT compilation for 2x speedup | ✅ |
| FlashAttention | Memory-efficient attention | ✅ |
| Multi-GPU | Tensor parallelism | ✅ |
| OpenAI API | Drop-in compatible server | ✅ |
| Streaming | Real-time token streaming | ✅ |
| CUDA Graphs | Zero-overhead inference | ✅ |
| Kubernetes | Production deployment | ✅ |
📊 Performance
Benchmarked on NVIDIA L40S (46GB) with LLaDA-8B:
| Batch Size | Throughput | Latency | Speedup |
|---|---|---|---|
| 1 | 265 tok/s | 241 ms | 1.0x |
| 2 | 484 tok/s | 132 ms | 1.8x |
| 4 | 786 tok/s | 81 ms | 3.0x |
| 8 | 1,056 tok/s | 61 ms | 4.0x |
Memory Usage
| Configuration | Memory | Notes |
|---|---|---|
| BF16 | 16.8 GB | Default |
| INT8 | ~10 GB | 1.7x reduction |
| INT4 | ~6 GB | 2.8x reduction |
🐳 Docker
# GPU image
docker run --gpus all -p 8000:8000 ghcr.io/dfastllm-project/dfastllm:gpu
# CPU image
docker run -p 8000:8000 ghcr.io/dfastllm-project/dfastllm:latest
☸️ Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: dfastllm
spec:
replicas: 1
template:
spec:
containers:
- name: dfastllm
image: ghcr.io/dfastllm-project/dfastllm:gpu
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8000
🌐 OpenAI-Compatible API
Start the server:
dfastllm-serve --model GSAI-ML/LLaDA-8B-Instruct --port 8000
Use with OpenAI client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="GSAI-ML/LLaDA-8B-Instruct",
messages=[{"role": "user", "content": "What is AI?"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")
🛠️ Supported Models
| Model | Parameters | Status |
|---|---|---|
| LLaDA-8B-Instruct | 8B | ✅ Full Support |
| LLaDA-8B-Base | 8B | ✅ Full Support |
| Dream | 7B | ⚠️ Experimental |
| MDLM | Various | ⚠️ Experimental |
📚 Documentation
🤝 Contributing
We welcome contributions! Here's how to get started:
# Clone the repo
git clone https://github.com/dfastllm-project/dfastllm.git
cd dfastllm
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Run linting
ruff check dfastllm/
black --check dfastllm/
See CONTRIBUTING.md for detailed guidelines.
📄 License
Apache 2.0 - See LICENSE for details.
🙏 Acknowledgments
- LLaDA - The primary diffusion LLM we support
- HuggingFace Transformers - Model loading infrastructure
- PyTorch - Deep learning framework
Made with ❤️ by the DFastLLM Team
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dfastllm-0.0.2.tar.gz.
File metadata
- Download URL: dfastllm-0.0.2.tar.gz
- Upload date:
- Size: 163.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9978847d04fe131c1a523b0c82a3f0279145415965f1ce523ab73218e35c8ea
|
|
| MD5 |
61589f3f76d9ac50f55e99e55037387d
|
|
| BLAKE2b-256 |
f6ac6ccb8da819cfdc86cf0fb638a2b34e2ab6d7e39ebe8e0ea557913a7b9041
|
File details
Details for the file dfastllm-0.0.2-py3-none-any.whl.
File metadata
- Download URL: dfastllm-0.0.2-py3-none-any.whl
- Upload date:
- Size: 186.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
153d2b0b233a33db32af8f082c718041b7af1419e79a710ab9dd75102fb69875
|
|
| MD5 |
9d437d9777eaf31416c22a8339bffc63
|
|
| BLAKE2b-256 |
4c8c1907327df98b3b606fda64ce9aecb28a09ba0865d9fe2d3bf75c6f1b3a20
|