dfastllm

High-performance inference engine for Diffusion Language Models - 3x faster with advanced optimizations

These details have not been verified by PyPI

Project links

Project description

DFastLLM

🚀 High-Performance Inference Engine for Diffusion Language Models

Python

Quick Start • Features • Performance • Documentation • Contributing

🎯 What is DFastLLM?

DFastLLM is a production-ready inference engine optimized for Diffusion Language Models (LLaDA, Dream, MDLM). Unlike autoregressive models that generate tokens sequentially, diffusion LLMs generate multiple tokens in parallel through iterative denoising — enabling massive throughput gains.

Traditional LLM:     Token → Token → Token → Token (sequential)
Diffusion LLM:       [████████] → [████████] → Done! (parallel)

⚡ Quick Start

Installation

pip install dfastllm

Generate Text

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dfastllm.engine.diffusion import DiffusionEngine

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "GSAI-ML/LLaDA-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("GSAI-ML/LLaDA-8B-Instruct", trust_remote_code=True)

# Create engine and generate
engine = DiffusionEngine(model, tokenizer)
output = engine.generate("What is artificial intelligence?", max_tokens=64)
print(output)

With Quantization (2x Memory Savings)

pip install dfastllm[quantization]

from dfastllm import load_quantized_model

# Load 8B model in ~8GB instead of ~16GB
model = load_quantized_model("GSAI-ML/LLaDA-8B-Instruct", "int4")

Batch Processing (4x Throughput)

prompts = ["What is AI?", "What is ML?", "What is DL?", "What is NLP?"]
outputs = engine.generate(prompts, max_tokens=64)
# Throughput: 1,000+ tok/s

🔥 Features

Feature	Description	Status
Diffusion Generation	Parallel token unmasking	✅
Batch Processing	Process multiple requests	✅
INT4/INT8 Quantization	2-4x memory reduction	✅
torch.compile	JIT compilation for 2x speedup	✅
FlashAttention	Memory-efficient attention	✅
Multi-GPU	Tensor parallelism	✅
OpenAI API	Drop-in compatible server	✅
Streaming	Real-time token streaming	✅
CUDA Graphs	Zero-overhead inference	✅
Kubernetes	Production deployment	✅

📊 Performance

Benchmarked on NVIDIA L40S (46GB) with LLaDA-8B:

Batch Size	Throughput	Latency	Speedup
1	265 tok/s	241 ms	1.0x
2	484 tok/s	132 ms	1.8x
4	786 tok/s	81 ms	3.0x
8	1,056 tok/s	61 ms	4.0x

Memory Usage

Configuration	Memory	Notes
BF16	16.8 GB	Default
INT8	~10 GB	1.7x reduction
INT4	~6 GB	2.8x reduction

🐳 Docker

# GPU image
docker run --gpus all -p 8000:8000 ghcr.io/dfastllm-project/dfastllm:gpu

# CPU image
docker run -p 8000:8000 ghcr.io/dfastllm-project/dfastllm:latest

☸️ Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dfastllm
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: dfastllm
        image: ghcr.io/dfastllm-project/dfastllm:gpu
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8000

🌐 OpenAI-Compatible API

Start the server:

dfastllm-serve --model GSAI-ML/LLaDA-8B-Instruct --port 8000

Use with OpenAI client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="GSAI-ML/LLaDA-8B-Instruct",
    messages=[{"role": "user", "content": "What is AI?"}],
    stream=True,
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

🛠️ Supported Models

Model	Parameters	Status
LLaDA-8B-Instruct	8B	✅ Full Support
LLaDA-8B-Base	8B	✅ Full Support
Dream	7B	⚠️ Experimental
MDLM	Various	⚠️ Experimental

📚 Documentation

🤝 Contributing

We welcome contributions! Here's how to get started:

# Clone the repo
git clone https://github.com/dfastllm-project/dfastllm.git
cd dfastllm

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run linting
ruff check dfastllm/
black --check dfastllm/

See CONTRIBUTING.md for detailed guidelines.

📄 License

Apache 2.0 - See LICENSE for details.

🙏 Acknowledgments

LLaDA - The primary diffusion LLM we support
HuggingFace Transformers - Model loading infrastructure
PyTorch - Deep learning framework

Made with ❤️ by the DFastLLM Team

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.4

Jan 2, 2026

This version

0.0.3

Jan 2, 2026

0.0.2

Jan 2, 2026

0.0.1

Jan 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfastllm-0.0.3.tar.gz (163.2 kB view details)

Uploaded Jan 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dfastllm-0.0.3-py3-none-any.whl (186.3 kB view details)

Uploaded Jan 2, 2026 Python 3

File details

Details for the file dfastllm-0.0.3.tar.gz.

File metadata

Download URL: dfastllm-0.0.3.tar.gz
Upload date: Jan 2, 2026
Size: 163.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for dfastllm-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`a908fbf7b600659cb915d864ef0bf1961a08aa1a5a4c80cc328f44ae247e1abc`
MD5	`8c9590bf81a8b8e6abfb8b4fb4aa98ed`
BLAKE2b-256	`55c0ac4c6119f3ea06fd319376344ead6b609f6db9d09a67bfe51a2b3f7f1bcf`

See more details on using hashes here.

File details

Details for the file dfastllm-0.0.3-py3-none-any.whl.

File metadata

Download URL: dfastllm-0.0.3-py3-none-any.whl
Upload date: Jan 2, 2026
Size: 186.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for dfastllm-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a7e100bbbb518a016896bd25e4a60d06dbc9978522b03128b700352c52fae445`
MD5	`a7bd678b3a7ff6fe37b4d3001d018e98`
BLAKE2b-256	`10de6a64d23616b8e23585ba34bcea162fb9620a42979d40861de63749b57a99`

See more details on using hashes here.

dfastllm 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DFastLLM

🎯 What is DFastLLM?

⚡ Quick Start

Installation

Generate Text

With Quantization (2x Memory Savings)

Batch Processing (4x Throughput)

🔥 Features

📊 Performance

Memory Usage

🐳 Docker

☸️ Kubernetes

🌐 OpenAI-Compatible API

🛠️ Supported Models

📚 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes