Skip to main content

High-performance inference engine for Diffusion Language Models - 3x faster with advanced optimizations

Project description

DFastLLM

DFastLLM

🚀 High-Performance Inference Engine for Diffusion Language Models

PyPI CI License Python Docs

Quick StartFeaturesPerformanceDocumentationContributing


🎯 What is DFastLLM?

DFastLLM is a production-ready inference engine optimized for Diffusion Language Models (LLaDA, Dream, MDLM). Unlike autoregressive models that generate tokens sequentially, diffusion LLMs generate multiple tokens in parallel through iterative denoising — enabling massive throughput gains.

Traditional LLM:     Token → Token → Token → Token (sequential)
Diffusion LLM:       [████████] → [████████] → Done! (parallel)

⚡ Quick Start

Installation

pip install dfastllm

Generate Text

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dfastllm.engine.diffusion import DiffusionEngine

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "GSAI-ML/LLaDA-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("GSAI-ML/LLaDA-8B-Instruct", trust_remote_code=True)

# Create engine and generate
engine = DiffusionEngine(model, tokenizer)
output = engine.generate("What is artificial intelligence?", max_tokens=64)
print(output)

With Quantization (2x Memory Savings)

pip install dfastllm[quantization]
from dfastllm import load_quantized_model

# Load 8B model in ~8GB instead of ~16GB
model = load_quantized_model("GSAI-ML/LLaDA-8B-Instruct", "int4")

Batch Processing (4x Throughput)

prompts = ["What is AI?", "What is ML?", "What is DL?", "What is NLP?"]
outputs = engine.generate(prompts, max_tokens=64)
# Throughput: 1,000+ tok/s

🔥 Features

Feature Description Status
Diffusion Generation Parallel token unmasking
Batch Processing Process multiple requests
INT4/INT8 Quantization 2-4x memory reduction
torch.compile JIT compilation for 2x speedup
FlashAttention Memory-efficient attention
Multi-GPU Tensor parallelism
OpenAI API Drop-in compatible server
Streaming Real-time token streaming
CUDA Graphs Zero-overhead inference
Kubernetes Production deployment

📊 Performance

Benchmarked on NVIDIA L40S (46GB) with LLaDA-8B:

Batch Size Throughput Latency Speedup
1 265 tok/s 241 ms 1.0x
2 484 tok/s 132 ms 1.8x
4 786 tok/s 81 ms 3.0x
8 1,056 tok/s 61 ms 4.0x

Memory Usage

Configuration Memory Notes
BF16 16.8 GB Default
INT8 ~10 GB 1.7x reduction
INT4 ~6 GB 2.8x reduction

🐳 Docker

# GPU image
docker run --gpus all -p 8000:8000 ghcr.io/dfastllm-project/dfastllm:gpu

# CPU image
docker run -p 8000:8000 ghcr.io/dfastllm-project/dfastllm:latest

☸️ Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dfastllm
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: dfastllm
        image: ghcr.io/dfastllm-project/dfastllm:gpu
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8000

🌐 OpenAI-Compatible API

Start the server:

dfastllm-serve --model GSAI-ML/LLaDA-8B-Instruct --port 8000

Use with OpenAI client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="GSAI-ML/LLaDA-8B-Instruct",
    messages=[{"role": "user", "content": "What is AI?"}],
    stream=True,
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

🛠️ Supported Models

Model Parameters Status
LLaDA-8B-Instruct 8B ✅ Full Support
LLaDA-8B-Base 8B ✅ Full Support
Dream 7B ⚠️ Experimental
MDLM Various ⚠️ Experimental

📚 Documentation


🤝 Contributing

We welcome contributions! Here's how to get started:

# Clone the repo
git clone https://github.com/dfastllm-project/dfastllm.git
cd dfastllm

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run linting
ruff check dfastllm/
black --check dfastllm/

See CONTRIBUTING.md for detailed guidelines.


📄 License

Apache 2.0 - See LICENSE for details.


🙏 Acknowledgments


Made with ❤️ by the DFastLLM Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfastllm-0.0.3.tar.gz (163.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dfastllm-0.0.3-py3-none-any.whl (186.3 kB view details)

Uploaded Python 3

File details

Details for the file dfastllm-0.0.3.tar.gz.

File metadata

  • Download URL: dfastllm-0.0.3.tar.gz
  • Upload date:
  • Size: 163.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for dfastllm-0.0.3.tar.gz
Algorithm Hash digest
SHA256 a908fbf7b600659cb915d864ef0bf1961a08aa1a5a4c80cc328f44ae247e1abc
MD5 8c9590bf81a8b8e6abfb8b4fb4aa98ed
BLAKE2b-256 55c0ac4c6119f3ea06fd319376344ead6b609f6db9d09a67bfe51a2b3f7f1bcf

See more details on using hashes here.

File details

Details for the file dfastllm-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: dfastllm-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 186.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for dfastllm-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a7e100bbbb518a016896bd25e4a60d06dbc9978522b03128b700352c52fae445
MD5 a7bd678b3a7ff6fe37b4d3001d018e98
BLAKE2b-256 10de6a64d23616b8e23585ba34bcea162fb9620a42979d40861de63749b57a99

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page