Skip to main content

High-performance inference engine for Diffusion Language Models - 3x faster with advanced optimizations

Project description

DFastLLM

DFastLLM

🚀 High-Performance Inference Engine for Diffusion Language Models

PyPI CI License Python Docs

Quick StartFeaturesPerformanceDocumentationContributing


🎯 What is DFastLLM?

DFastLLM is a production-ready inference engine optimized for Diffusion Language Models (LLaDA, Dream, MDLM). Unlike autoregressive models that generate tokens sequentially, diffusion LLMs generate multiple tokens in parallel through iterative denoising — enabling massive throughput gains.

Traditional LLM:     Token → Token → Token → Token (sequential)
Diffusion LLM:       [████████] → [████████] → Done! (parallel)

⚡ Quick Start

Installation

pip install dfastllm

Generate Text

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dfastllm.engine.diffusion import DiffusionEngine

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "GSAI-ML/LLaDA-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("GSAI-ML/LLaDA-8B-Instruct", trust_remote_code=True)

# Create engine and generate
engine = DiffusionEngine(model, tokenizer)
output = engine.generate("What is artificial intelligence?", max_tokens=64)
print(output)

With Quantization (2x Memory Savings)

pip install dfastllm[quantization]
from dfastllm import load_quantized_model

# Load 8B model in ~8GB instead of ~16GB
model = load_quantized_model("GSAI-ML/LLaDA-8B-Instruct", "int4")

Batch Processing (4x Throughput)

prompts = ["What is AI?", "What is ML?", "What is DL?", "What is NLP?"]
outputs = engine.generate(prompts, max_tokens=64)
# Throughput: 1,000+ tok/s

🔥 Features

Feature Description Status
Diffusion Generation Parallel token unmasking
Batch Processing Process multiple requests
INT4/INT8 Quantization 2-4x memory reduction
torch.compile JIT compilation for 2x speedup
FlashAttention Memory-efficient attention
Multi-GPU Tensor parallelism
OpenAI API Drop-in compatible server
Streaming Real-time token streaming
CUDA Graphs Zero-overhead inference
Kubernetes Production deployment

📊 Performance

Benchmarked on NVIDIA L40S (46GB) with LLaDA-8B:

Batch Size Throughput Latency Speedup
1 265 tok/s 241 ms 1.0x
2 484 tok/s 132 ms 1.8x
4 786 tok/s 81 ms 3.0x
8 1,056 tok/s 61 ms 4.0x

Memory Usage

Configuration Memory Notes
BF16 16.8 GB Default
INT8 ~10 GB 1.7x reduction
INT4 ~6 GB 2.8x reduction

🐳 Docker

# GPU image
docker run --gpus all -p 8000:8000 ghcr.io/dfastllm-project/dfastllm:gpu

# CPU image
docker run -p 8000:8000 ghcr.io/dfastllm-project/dfastllm:latest

☸️ Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dfastllm
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: dfastllm
        image: ghcr.io/dfastllm-project/dfastllm:gpu
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8000

🌐 OpenAI-Compatible API

Start the server:

dfastllm-serve --model GSAI-ML/LLaDA-8B-Instruct --port 8000

Use with OpenAI client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="GSAI-ML/LLaDA-8B-Instruct",
    messages=[{"role": "user", "content": "What is AI?"}],
    stream=True,
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

🛠️ Supported Models

Model Parameters Status
LLaDA-8B-Instruct 8B ✅ Full Support
LLaDA-8B-Base 8B ✅ Full Support
Dream 7B ⚠️ Experimental
MDLM Various ⚠️ Experimental

📚 Documentation


🤝 Contributing

We welcome contributions! Here's how to get started:

# Clone the repo
git clone https://github.com/dfastllm-project/dfastllm.git
cd dfastllm

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run linting
ruff check dfastllm/
black --check dfastllm/

See CONTRIBUTING.md for detailed guidelines.


📄 License

Apache 2.0 - See LICENSE for details.


🙏 Acknowledgments


Made with ❤️ by the DFastLLM Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfastllm-0.0.1.tar.gz (163.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dfastllm-0.0.1-py3-none-any.whl (186.2 kB view details)

Uploaded Python 3

File details

Details for the file dfastllm-0.0.1.tar.gz.

File metadata

  • Download URL: dfastllm-0.0.1.tar.gz
  • Upload date:
  • Size: 163.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for dfastllm-0.0.1.tar.gz
Algorithm Hash digest
SHA256 3e210d31d46e235e0ae20fc31baa898dff1f76f35646a31b03c452993b063e9c
MD5 96593f96a7db753e4246663d0b47b012
BLAKE2b-256 dab862cd5d24589e407b8e55172c26b9583fe84befb5a194d8e9762b0357ab47

See more details on using hashes here.

File details

Details for the file dfastllm-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: dfastllm-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 186.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for dfastllm-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 78026131adb48ac0aec5edd97a236f9cb07912fd5797fbee14564628eb9ef888
MD5 e9b782e2bb8f69c5cb6c590724f668b3
BLAKE2b-256 04650454223bbdde959a13f65ddd9dbd1a36c3c46e9f729601ab7bac67c7f86f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page