Skip to main content

High-performance inference engine for Diffusion Language Models - 3x faster with advanced optimizations

Project description

DFastLLM

DFastLLM

🚀 High-Performance Inference Engine for Diffusion Language Models

PyPI CI License Python Docs

Quick StartFeaturesPerformanceDocumentationContributing


🎯 What is DFastLLM?

DFastLLM is a production-ready inference engine optimized for Diffusion Language Models (LLaDA, Dream, MDLM). Unlike autoregressive models that generate tokens sequentially, diffusion LLMs generate multiple tokens in parallel through iterative denoising — enabling massive throughput gains.

Traditional LLM:     Token → Token → Token → Token (sequential)
Diffusion LLM:       [████████] → [████████] → Done! (parallel)

⚡ Quick Start

Installation

pip install dfastllm

Generate Text

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dfastllm.engine.diffusion import DiffusionEngine

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "GSAI-ML/LLaDA-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("GSAI-ML/LLaDA-8B-Instruct", trust_remote_code=True)

# Create engine and generate
engine = DiffusionEngine(model, tokenizer)
output = engine.generate("What is artificial intelligence?", max_tokens=64)
print(output)

With Quantization (2x Memory Savings)

pip install dfastllm[quantization]
from dfastllm import load_quantized_model

# Load 8B model in ~8GB instead of ~16GB
model = load_quantized_model("GSAI-ML/LLaDA-8B-Instruct", "int4")

Batch Processing (4x Throughput)

prompts = ["What is AI?", "What is ML?", "What is DL?", "What is NLP?"]
outputs = engine.generate(prompts, max_tokens=64)
# Throughput: 1,000+ tok/s

🔥 Features

Feature Description Status
Diffusion Generation Parallel token unmasking
Batch Processing Process multiple requests
INT4/INT8 Quantization 2-4x memory reduction
torch.compile JIT compilation for 2x speedup
FlashAttention Memory-efficient attention
Multi-GPU Tensor parallelism
OpenAI API Drop-in compatible server
Streaming Real-time token streaming
CUDA Graphs Zero-overhead inference
Kubernetes Production deployment

📊 Performance

Benchmarked on NVIDIA L40S (46GB) with LLaDA-8B:

Batch Size Throughput Latency Speedup
1 265 tok/s 241 ms 1.0x
2 484 tok/s 132 ms 1.8x
4 786 tok/s 81 ms 3.0x
8 1,056 tok/s 61 ms 4.0x

Memory Usage

Configuration Memory Notes
BF16 16.8 GB Default
INT8 ~10 GB 1.7x reduction
INT4 ~6 GB 2.8x reduction

🐳 Docker

# GPU image
docker run --gpus all -p 8000:8000 ghcr.io/dfastllm-project/dfastllm:gpu

# CPU image
docker run -p 8000:8000 ghcr.io/dfastllm-project/dfastllm:latest

☸️ Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dfastllm
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: dfastllm
        image: ghcr.io/dfastllm-project/dfastllm:gpu
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8000

🌐 OpenAI-Compatible API

Start the server:

dfastllm-serve --model GSAI-ML/LLaDA-8B-Instruct --port 8000

Use with OpenAI client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="GSAI-ML/LLaDA-8B-Instruct",
    messages=[{"role": "user", "content": "What is AI?"}],
    stream=True,
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

🛠️ Supported Models

Model Parameters Status
LLaDA-8B-Instruct 8B ✅ Full Support
LLaDA-8B-Base 8B ✅ Full Support
Dream 7B ⚠️ Experimental
MDLM Various ⚠️ Experimental

📚 Documentation


🤝 Contributing

We welcome contributions! Here's how to get started:

# Clone the repo
git clone https://github.com/dfastllm-project/dfastllm.git
cd dfastllm

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run linting
ruff check dfastllm/
black --check dfastllm/

See CONTRIBUTING.md for detailed guidelines.


📄 License

Apache 2.0 - See LICENSE for details.


🙏 Acknowledgments


Made with ❤️ by the DFastLLM Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfastllm-0.0.4.tar.gz (163.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dfastllm-0.0.4-py3-none-any.whl (186.6 kB view details)

Uploaded Python 3

File details

Details for the file dfastllm-0.0.4.tar.gz.

File metadata

  • Download URL: dfastllm-0.0.4.tar.gz
  • Upload date:
  • Size: 163.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for dfastllm-0.0.4.tar.gz
Algorithm Hash digest
SHA256 3dcad96038fc747917d90afc6ed25c7c22b3be896f6fc6ff908854b4f0618a2d
MD5 4e2ed9f883f524047c6cc99680281cfe
BLAKE2b-256 ead5d9da02a578b6e2f4679e6d6f8b634b7a396737e4c3162dcb54301f2a5189

See more details on using hashes here.

File details

Details for the file dfastllm-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: dfastllm-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 186.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for dfastllm-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 002ad851b3694eb6975abc796cb3b0f821f90e62771512d26e6b8a252f546fb2
MD5 bf1fb9da3d58403434fb6164729ca0ed
BLAKE2b-256 10a4ae417ce8ad27da4bc8fe73b4d0a2cdd353925d34c8613d57ddb1628892ec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page