Ultra-fast parallel training and inference for language models
Project description
Parallel-LLM: Ultra-Fast Parallel Training & Inference
Parallel-LLM is a production-ready, cross-platform library for training and inference of language models with revolutionary parallel token generation. Generate all tokens at once instead of one-by-one using our hybrid diffusion-energy architecture.
๐ Cross-Platform Support: Works seamlessly on Windows, Linux, and macOS with graceful degradation for optional dependencies. One-command installation works everywhere!
๐ Key Features
Training
- Full Parallelism: Data + Tensor + Pipeline + Expert parallelism
- FSDP2: PyTorch's latest fully sharded data parallel with DTensor
- DeepSpeed ZeRO: Stages 1, 2, 3 with CPU offloading
- Flash Attention 3: Up to 75% GPU utilization on H100
- torch.compile: Automatic kernel fusion and optimization
- Mixed Precision: FP16, BF16, FP8 support
- Gradient Checkpointing: Selective activation checkpointing
Inference
- Parallel Generation: Generate 64+ tokens simultaneously
- 1.5-3ร Faster: Compared to autoregressive decoding
- Paged KV Cache: Memory-efficient attention like vLLM
- CUDA Graphs: Zero CPU overhead
- Continuous Batching: Dynamic request handling
- Speculative Decoding: Draft model verification
Multimodal
- Vision-Language Models: CLIP-style contrastive learning
- Cross-Modal Fusion: Attention-based alignment
- Unified Architecture: Single model for text + vision
๐ฆ Installation
๐ One-Command Installation (Cross-Platform)
pip install parallel-llm
This single command works on Windows, Linux, and macOS! The installer automatically detects your platform and installs the appropriate PyTorch version.
๐ ๏ธ Advanced Installation
For more control or if the one-command install fails:
# Download and run the cross-platform installer
curl -fsSL https://raw.githubusercontent.com/furqan-y-khan/parallel-llm/main/install_parallel_llm.py | python3
# Or download and run locally
wget https://raw.githubusercontent.com/furqan-y-khan/parallel-llm/main/install_parallel_llm.py
python install_parallel_llm.py
Or manually:
# Step 1: Install PyTorch (platform-specific)
# Windows:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # CUDA
# OR
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu # CPU only
# Linux:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # CUDA
# OR
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu # CPU only
# macOS:
pip install torch torchvision torchaudio # CPU/MPS support included
# Step 2: Install Parallel-LLM
pip install parallel-llm
Optional Dependencies
Install with specific features (all cross-platform where possible):
# GPU acceleration (may not be available on all platforms)
pip install parallel-llm[gpu]
# Distributed training (may not be available on all platforms)
pip install parallel-llm[distributed]
# Multimodal models (cross-platform)
pip install parallel-llm[multimodal]
# Inference optimization (may not be available on all platforms)
pip install parallel-llm[inference]
# Logging and monitoring (cross-platform)
pip install parallel-llm[logging]
# Dataset utilities (cross-platform)
pip install parallel-llm[datasets]
# Development tools (cross-platform)
pip install parallel-llm[dev]
# Install everything
pip install parallel-llm[all]
From Source
git clone https://github.com/furqan-y-khan/parallel-llm
cd parallel-llm
pip install -e .
Requirements
- Python >= 3.9
- PyTorch >= 2.2.0 (automatically installed with platform-specific version)
- No CUDA required - works on CPU-only systems
- Optional: CUDA >= 11.8 for GPU acceleration
- Optional: 16GB+ GPU memory recommended for full functionality
๐ฅ Examples
๐ Quick Start Examples
All examples are available in the examples/ directory and include cross-platform compatibility checks.
1. Text Generation (Unimodal Inference)
File: examples/inference_unimodal.py
Demonstrates parallel text generation using the DiffusionTransformer architecture.
cd examples
python inference_unimodal.py
Features:
- Parallel token generation (64 tokens simultaneously)
- GPT-2 tokenizer integration
- Adaptive refinement based on confidence scores
- CUDA graphs for maximum performance
2. Image Captioning (Multimodal Inference)
File: examples/inference_multimodal.py
Shows how to generate captions for images using multimodal models.
cd examples
python inference_multimodal.py
Features:
- Vision-language understanding
- ViT image encoder integration
- Cross-modal attention fusion
- COCO dataset image processing
3. Language Model Training (Unimodal Training)
File: examples/train_unimodal.py
Complete distributed training setup for text-only language models.
cd examples
python train_unimodal.py
Features:
- FSDP (Fully Sharded Data Parallel)
- Mixed precision training (BF16/FP16)
- Gradient checkpointing
- WikiText-2 dataset integration
- Distributed training with NCCL
4. Vision-Language Training (Multimodal Training)
File: examples/train_multimodal.py
Training multimodal models that understand both text and images.
cd examples
python train_multimodal.py
Features:
- Contrastive learning (CLIP-style)
- Cross-attention fusion
- Image-text pair processing
- Gradient checkpointing for memory efficiency
๐ Code Examples
Basic Text Generation
from parallel_llm import DiffusionTransformer, ModelConfig, ParallelGenerator, GenerationConfig
# Configure model
config = ModelConfig(
vocab_size=50257, # GPT-2 vocabulary
hidden_size=1024,
num_hidden_layers=12,
num_attention_heads=16,
use_flash_attention=True,
)
# Create model
model = DiffusionTransformer(config)
# Configure generation
gen_config = GenerationConfig(
max_new_tokens=128,
num_parallel_tokens=64, # Generate 64 tokens at once!
num_refinement_steps=5,
temperature=0.8,
top_k=50,
)
# Create generator
generator = ParallelGenerator(
model=model,
config=gen_config,
use_kv_cache=True,
use_cuda_graphs=True
)
# Generate text
prompt = "The future of AI is"
generated_tokens = generator.generate(tokenizer.encode(prompt))
generated_text = tokenizer.decode(generated_tokens[0])
Multimodal Image Understanding
from parallel_llm import DiffusionTransformer, MultimodalConfig
from transformers import AutoImageProcessor, AutoTokenizer
# Configure multimodal model
config = MultimodalConfig(
vocab_size=50257,
vision_encoder="vit",
image_size=224,
patch_size=16,
fusion_type="cross_attention",
use_contrastive=True,
)
# Create model
model = DiffusionTransformer(config)
# Process image and text
image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Load and process image
image = Image.open("path/to/image.jpg")
pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
# Prepare text prompt
text = "Describe this image:"
input_ids = tokenizer.encode(text, return_tensors="pt")
# Generate caption
outputs = generator.generate(
input_ids=input_ids,
pixel_values=pixel_values
)
caption = tokenizer.decode(outputs[0])
Distributed Training Setup
from parallel_llm import DiffusionTransformer, TrainingConfig, DistributedTrainer
from torch.utils.data import DataLoader
# Configure training
train_config = TrainingConfig(
output_dir="./checkpoints",
num_train_steps=50000,
batch_size=8,
learning_rate=3e-4,
warmup_steps=1000,
use_fsdp=True, # Fully Sharded Data Parallel
fsdp_sharding_strategy="full",
mixed_precision="bf16",
gradient_checkpointing=True,
logging_steps=10,
save_steps=1000,
)
# Create model and trainer
model = DiffusionTransformer(model_config)
trainer = DistributedTrainer(
model=model,
train_config=train_config,
model_config=model_config,
train_dataloader=train_dataloader,
)
# Train (supports multi-GPU, multi-node)
trainer.train()
๐ง Platform-Specific Notes
Linux (Recommended for full functionality)
# Install with all GPU features
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install parallel-llm[gpu,distributed,inference]
Windows/macOS (CPU-only or limited GPU)
# CPU-only installation
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install parallel-llm[multimodal,logging]
# macOS with MPS (Metal Performance Shaders)
pip install torch torchvision torchaudio # Includes MPS support
๐ฏ Running Examples on Different Platforms
All examples include automatic platform detection and provide helpful guidance:
- On Linux with CUDA: Full functionality with GPU acceleration
- On Windows/macOS: CPU-only mode with clear instructions to switch to Linux
- Missing dependencies: Graceful degradation with installation guidance
Each example checks for required dependencies and provides platform-specific installation instructions if something is missing.
๐ฅ๏ธ Command Line Interface
Parallel-LLM includes CLI tools for easy training and inference:
# Train a model
parallel-llm-train --config config.yaml --output-dir ./checkpoints
# Run inference
parallel-llm-infer --model-path ./checkpoints/model.bin --prompt "Hello world"
Compatibility Module
The library includes a cross-platform compatibility module:
from parallel_llm import compat
# Check PyTorch CUDA availability
cuda_ok, cuda_msg = compat.check_pytorch_cuda()
print(f"CUDA: {cuda_msg}")
# Get optimal device
device, device_msg = compat.get_optimal_device()
print(f"Using: {device_msg}")
# Get platform-specific installation instructions
print(compat.get_installation_instructions())
๐๏ธ Architecture
Hybrid Diffusion-Energy Framework
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Input: [MASK] [MASK] [MASK] ... [MASK] โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Diffusion Transformer โ
โ (Bidirectional Attention) โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Multi-Token Predictions โ
โ With Confidence Scores โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Energy-Based Refinement โ
โ (Sequence-Level Scoring) โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Adaptive Masking โ
โ (Keep high-confidence) โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ
โ
Output: All tokens generated
Key Innovations
- Masked Diffusion: Start with all [MASK] tokens, iteratively refine
- Bidirectional Attention: Each token sees entire context
- Confidence-Based Masking: Adaptively accept high-confidence predictions
- Energy Model: Global sequence coherence checking
- Parallel Decoding: 64+ tokens per forward pass
๐ Performance
Speed Comparison (Llama-7B equivalent)
| Method | Tokens/sec | Speedup |
|---|---|---|
| Autoregressive (HF) | 25 | 1.0ร |
| vLLM | 45 | 1.8ร |
| Parallel-LLM | 75 | 3.0ร |
Memory Efficiency
| Batch Size | Standard | Parallel-LLM |
|---|---|---|
| 1 | 16 GB | 12 GB |
| 8 | 128 GB | 48 GB |
| 32 | OOM | 96 GB |
๐ ๏ธ Advanced Features
Distributed Training
# Launch with torchrun
torchrun --nproc_per_node=8 train.py \
--use-fsdp \
--fsdp-sharding-strategy full \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1
Custom Kernels
from parallel_llm.kernels import fused_attention, parallel_decode
# Use optimized Triton kernels
output = fused_attention(query, key, value, use_flash=True)
# Parallel token decoding
tokens = parallel_decode(logits, num_parallel=64)
Quantization
from parallel_llm.quantization import quantize_model
# Quantize to INT8 or FP8
model = quantize_model(model, precision="fp8")
๐ Documentation
๐ Guides
- Examples - Complete working examples for all use cases
- Training Guide - Distributed training setup
- Inference Guide - Parallel generation optimization
- Multimodal Guide - Vision-language models
- Performance Tuning - Optimization techniques
๐ง API Reference
- Core API - Model configurations and architectures
- Training API - Distributed training components
- Inference API - Parallel generation systems
- Utilities - Data loading and processing
- Compatibility - Cross-platform support
๐ Quick References
- Installation Script - Automated cross-platform installer
- PyPI Package - Package information
- GitHub Repository - Source code
๐ค Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
๐ License
Apache 2.0 License. See LICENSE for details.
๐ Acknowledgments
Built on research and technologies from:
Core Technologies
- PyTorch - Deep learning framework
- Transformers (Hugging Face) - Model architectures
- Accelerate (Hugging Face) - Distributed training utilities
Research Papers & Methods
- FlashAttention (Dao et al.) - Efficient attention computation
- Diffusion Language Models - Parallel generation techniques
- DeepSpeed ZeRO (Microsoft) - Memory-efficient training
- vLLM (UC Berkeley) - High-throughput inference
- PyTorch FSDP (Meta) - Distributed data parallel
Datasets & Models
- GPT-2 (OpenAI) - Base model architecture
- ViT (Google) - Vision transformer
- CLIP (OpenAI) - Vision-language understanding
- WikiText & COCO - Training datasets
๐ Contact & Support
- Email: furqan@lastappstanding.com
- GitHub Issues: Report bugs & request features
- Discussions: Community forum
Getting Help
- Check the examples in the
examples/directory - Read the documentation linked above
- Search existing issues on GitHub
- Open a new issue if needed
๐ Community
If you find this project useful, please:
- โญ Star the repository
- ๐ Report any issues you encounter
- ๐ก Suggest new features or improvements
- ๐ค Contribute code or documentation
๐ Project Stats
- Version: 0.4.6
- Python: 3.9+
- Platforms: Windows, Linux, macOS
- License: Apache 2.0
- Status: Active Development
Citation
@software{parallel_llm_2025,
title = {Parallel-LLM: Ultra-Fast Parallel Training and Inference for Language Models},
author = {Khan, Furqan and Last App Standing Team},
year = {2025},
url = {https://github.com/furqan-y-khan/parallel-llm},
version = {0.4.6}
}
@article{parallel_generation_2025,
title = {Parallel Token Generation: Diffusion-Based Language Model Inference},
author = {Khan, Furqan},
journal = {arXiv preprint},
year = {2025},
note = {Parallel-LLM library implementation}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parallel_llm-0.4.6.tar.gz.
File metadata
- Download URL: parallel_llm-0.4.6.tar.gz
- Upload date:
- Size: 58.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0b4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8822a238a6b115a7368b0a3c203ce060bd118521abf0b84ed3685fa31814655c
|
|
| MD5 |
7bf3e4546e83c5e028d8f29f0e0375df
|
|
| BLAKE2b-256 |
d9e1441f5ff272d82fc2c8d879c4710b40a02fe7fc831b66639c1f2da24dcc23
|
File details
Details for the file parallel_llm-0.4.6-py3-none-any.whl.
File metadata
- Download URL: parallel_llm-0.4.6-py3-none-any.whl
- Upload date:
- Size: 35.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0b4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d2b3feae661c00261077c09b5c547e7cfade5ccdd088b87d9eb745e1167c7aa
|
|
| MD5 |
7a67b57230e90240dc28a9b64156d848
|
|
| BLAKE2b-256 |
f212ef0f36c2508e7805b33c98f8c0d3f69e2dcc8ddaa92205cc40a6c268d2c8
|