Ultra-fast parallel training and inference for language models

These details have not been verified by PyPI

Project links

Project description

Parallel-LLM: Ultra-Fast Parallel Training & Inference

Parallel-LLM is a production-ready, cross-platform library for training and inference of language models with revolutionary parallel token generation. Generate all tokens at once instead of one-by-one using our hybrid diffusion-energy architecture.

🚀 Cross-Platform Support: Works seamlessly on Windows, Linux, and macOS with graceful degradation for optional dependencies. One-command installation works everywhere!

🚀 Key Features

Training

Full Parallelism: Data + Tensor + Pipeline + Expert parallelism
FSDP2: PyTorch's latest fully sharded data parallel with DTensor
DeepSpeed ZeRO: Stages 1, 2, 3 with CPU offloading
Flash Attention 3: Up to 75% GPU utilization on H100
torch.compile: Automatic kernel fusion and optimization
Mixed Precision: FP16, BF16, FP8 support
Gradient Checkpointing: Selective activation checkpointing

Inference

Parallel Generation: Generate 64+ tokens simultaneously
1.5-3× Faster: Compared to autoregressive decoding
Paged KV Cache: Memory-efficient attention like vLLM
CUDA Graphs: Zero CPU overhead
Continuous Batching: Dynamic request handling
Speculative Decoding: Draft model verification

Multimodal

Vision-Language Models: CLIP-style contrastive learning
Cross-Modal Fusion: Attention-based alignment
Unified Architecture: Single model for text + vision

📦 Installation

🚀 One-Command Installation (Cross-Platform)

pip install parallel-llm

This single command works on Windows, Linux, and macOS! The installer automatically detects your platform and installs the appropriate PyTorch version.

🛠️ Advanced Installation

For more control or if the one-command install fails:

# Download and run the cross-platform installer
curl -fsSL https://raw.githubusercontent.com/furqan-y-khan/parallel-llm/main/install_parallel_llm.py | python3

# Or download and run locally
wget https://raw.githubusercontent.com/furqan-y-khan/parallel-llm/main/install_parallel_llm.py
python install_parallel_llm.py

Or manually:

# Step 1: Install PyTorch (platform-specific)
# Windows:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # CUDA
# OR
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu    # CPU only

# Linux:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # CUDA
# OR
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu    # CPU only

# macOS:
pip install torch torchvision torchaudio  # CPU/MPS support included

# Step 2: Install Parallel-LLM
pip install parallel-llm

Optional Dependencies

Install with specific features (all cross-platform where possible):

# GPU acceleration (may not be available on all platforms)
pip install parallel-llm[gpu]

# Distributed training (may not be available on all platforms)
pip install parallel-llm[distributed]

# Multimodal models (cross-platform)
pip install parallel-llm[multimodal]

# Inference optimization (may not be available on all platforms)
pip install parallel-llm[inference]

# Logging and monitoring (cross-platform)
pip install parallel-llm[logging]

# Dataset utilities (cross-platform)
pip install parallel-llm[datasets]

# Development tools (cross-platform)
pip install parallel-llm[dev]

# Install everything
pip install parallel-llm[all]

From Source

git clone https://github.com/furqan-y-khan/parallel-llm
cd parallel-llm
pip install -e .

Requirements

Python >= 3.9
PyTorch >= 2.2.0 (automatically installed with platform-specific version)
No CUDA required - works on CPU-only systems
Optional: CUDA >= 11.8 for GPU acceleration
Optional: 16GB+ GPU memory recommended for full functionality

🔥 Examples

🚀 Quick Start Examples

All examples are available in the examples/ directory and include cross-platform compatibility checks.

1. Text Generation (Unimodal Inference)

File: examples/inference_unimodal.py

Demonstrates parallel text generation using the DiffusionTransformer architecture.

cd examples
python inference_unimodal.py

Features:

Parallel token generation (64 tokens simultaneously)
GPT-2 tokenizer integration
Adaptive refinement based on confidence scores
CUDA graphs for maximum performance

2. Image Captioning (Multimodal Inference)

File: examples/inference_multimodal.py

Shows how to generate captions for images using multimodal models.

cd examples
python inference_multimodal.py

Features:

Vision-language understanding
ViT image encoder integration
Cross-modal attention fusion
COCO dataset image processing

3. Language Model Training (Unimodal Training)

File: examples/train_unimodal.py

Complete distributed training setup for text-only language models.

cd examples
python train_unimodal.py

Features:

FSDP (Fully Sharded Data Parallel)
Mixed precision training (BF16/FP16)
Gradient checkpointing
WikiText-2 dataset integration
Distributed training with NCCL

4. Vision-Language Training (Multimodal Training)

File: examples/train_multimodal.py

Training multimodal models that understand both text and images.

cd examples
python train_multimodal.py

Features:

Contrastive learning (CLIP-style)
Cross-attention fusion
Image-text pair processing
Gradient checkpointing for memory efficiency

📖 Code Examples

Basic Text Generation

from parallel_llm import DiffusionTransformer, ModelConfig, ParallelGenerator, GenerationConfig

# Configure model
config = ModelConfig(
    vocab_size=50257,  # GPT-2 vocabulary
    hidden_size=1024,
    num_hidden_layers=12,
    num_attention_heads=16,
    use_flash_attention=True,
)

# Create model
model = DiffusionTransformer(config)

# Configure generation
gen_config = GenerationConfig(
    max_new_tokens=128,
    num_parallel_tokens=64,  # Generate 64 tokens at once!
    num_refinement_steps=5,
    temperature=0.8,
    top_k=50,
)

# Create generator
generator = ParallelGenerator(
    model=model,
    config=gen_config,
    use_kv_cache=True,
    use_cuda_graphs=True
)

# Generate text
prompt = "The future of AI is"
generated_tokens = generator.generate(tokenizer.encode(prompt))
generated_text = tokenizer.decode(generated_tokens[0])

Multimodal Image Understanding

from parallel_llm import DiffusionTransformer, MultimodalConfig
from transformers import AutoImageProcessor, AutoTokenizer

# Configure multimodal model
config = MultimodalConfig(
    vocab_size=50257,
    vision_encoder="vit",
    image_size=224,
    patch_size=16,
    fusion_type="cross_attention",
    use_contrastive=True,
)

# Create model
model = DiffusionTransformer(config)

# Process image and text
image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Load and process image
image = Image.open("path/to/image.jpg")
pixel_values = image_processor(images=image, return_tensors="pt").pixel_values

# Prepare text prompt
text = "Describe this image:"
input_ids = tokenizer.encode(text, return_tensors="pt")

# Generate caption
outputs = generator.generate(
    input_ids=input_ids,
    pixel_values=pixel_values
)
caption = tokenizer.decode(outputs[0])

Distributed Training Setup

from parallel_llm import DiffusionTransformer, TrainingConfig, DistributedTrainer
from torch.utils.data import DataLoader

# Configure training
train_config = TrainingConfig(
    output_dir="./checkpoints",
    num_train_steps=50000,
    batch_size=8,
    learning_rate=3e-4,
    warmup_steps=1000,
    use_fsdp=True,  # Fully Sharded Data Parallel
    fsdp_sharding_strategy="full",
    mixed_precision="bf16",
    gradient_checkpointing=True,
    logging_steps=10,
    save_steps=1000,
)

# Create model and trainer
model = DiffusionTransformer(model_config)
trainer = DistributedTrainer(
    model=model,
    train_config=train_config,
    model_config=model_config,
    train_dataloader=train_dataloader,
)

# Train (supports multi-GPU, multi-node)
trainer.train()

🔧 Platform-Specific Notes

Linux (Recommended for full functionality)

# Install with all GPU features
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install parallel-llm[gpu,distributed,inference]

Windows/macOS (CPU-only or limited GPU)

# CPU-only installation
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install parallel-llm[multimodal,logging]

# macOS with MPS (Metal Performance Shaders)
pip install torch torchvision torchaudio  # Includes MPS support

🎯 Running Examples on Different Platforms

All examples include automatic platform detection and provide helpful guidance:

On Linux with CUDA: Full functionality with GPU acceleration
On Windows/macOS: CPU-only mode with clear instructions to switch to Linux
Missing dependencies: Graceful degradation with installation guidance

Each example checks for required dependencies and provides platform-specific installation instructions if something is missing.

🖥️ Command Line Interface

Parallel-LLM includes CLI tools for easy training and inference:

# Train a model
parallel-llm-train --config config.yaml --output-dir ./checkpoints

# Run inference
parallel-llm-infer --model-path ./checkpoints/model.bin --prompt "Hello world"

Compatibility Module

The library includes a cross-platform compatibility module:

from parallel_llm import compat

# Check PyTorch CUDA availability
cuda_ok, cuda_msg = compat.check_pytorch_cuda()
print(f"CUDA: {cuda_msg}")

# Get optimal device
device, device_msg = compat.get_optimal_device()
print(f"Using: {device_msg}")

# Get platform-specific installation instructions
print(compat.get_installation_instructions())

🏗️ Architecture

Hybrid Diffusion-Energy Framework

┌─────────────────────────────────────────┐
│  Input: [MASK] [MASK] [MASK] ... [MASK] │
└───────────────┬─────────────────────────┘
                ↓
    ┌───────────────────────────┐
    │  Diffusion Transformer     │
    │  (Bidirectional Attention) │
    └───────────┬───────────────┘
                ↓
    ┌───────────────────────────┐
    │  Multi-Token Predictions   │
    │  With Confidence Scores    │
    └───────────┬───────────────┘
                ↓
    ┌───────────────────────────┐
    │  Energy-Based Refinement   │
    │  (Sequence-Level Scoring)  │
    └───────────┬───────────────┘
                ↓
    ┌───────────────────────────┐
    │  Adaptive Masking          │
    │  (Keep high-confidence)    │
    └───────────┬───────────────┘
                ↓
    Output: All tokens generated

Key Innovations

Masked Diffusion: Start with all [MASK] tokens, iteratively refine
Bidirectional Attention: Each token sees entire context
Confidence-Based Masking: Adaptively accept high-confidence predictions
Energy Model: Global sequence coherence checking
Parallel Decoding: 64+ tokens per forward pass

📊 Performance

Speed Comparison (Llama-7B equivalent)

Method	Tokens/sec	Speedup
Autoregressive (HF)	25	1.0×
vLLM	45	1.8×
Parallel-LLM	75	3.0×

Memory Efficiency

Batch Size	Standard	Parallel-LLM
1	16 GB	12 GB
8	128 GB	48 GB
32	OOM	96 GB

🛠️ Advanced Features

Distributed Training

# Launch with torchrun
torchrun --nproc_per_node=8 train.py \
    --use-fsdp \
    --fsdp-sharding-strategy full \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 1

Custom Kernels

from parallel_llm.kernels import fused_attention, parallel_decode

# Use optimized Triton kernels
output = fused_attention(query, key, value, use_flash=True)

# Parallel token decoding
tokens = parallel_decode(logits, num_parallel=64)

Quantization

from parallel_llm.quantization import quantize_model

# Quantize to INT8 or FP8
model = quantize_model(model, precision="fp8")

📚 Documentation

📖 Guides

Examples - Complete working examples for all use cases
Training Guide - Distributed training setup
Inference Guide - Parallel generation optimization
Multimodal Guide - Vision-language models
Performance Tuning - Optimization techniques

🔧 API Reference

Core API - Model configurations and architectures
Training API - Distributed training components
Inference API - Parallel generation systems
Utilities - Data loading and processing
Compatibility - Cross-platform support

📋 Quick References

Installation Script - Automated cross-platform installer
PyPI Package - Package information
GitHub Repository - Source code

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

📄 License

Apache 2.0 License. See LICENSE for details.

🙏 Acknowledgments

Built on research and technologies from:

Core Technologies

PyTorch - Deep learning framework
Transformers (Hugging Face) - Model architectures
Accelerate (Hugging Face) - Distributed training utilities

Research Papers & Methods

FlashAttention (Dao et al.) - Efficient attention computation
Diffusion Language Models - Parallel generation techniques
DeepSpeed ZeRO (Microsoft) - Memory-efficient training
vLLM (UC Berkeley) - High-throughput inference
PyTorch FSDP (Meta) - Distributed data parallel

Datasets & Models

GPT-2 (OpenAI) - Base model architecture
ViT (Google) - Vision transformer
CLIP (OpenAI) - Vision-language understanding
WikiText & COCO - Training datasets

📞 Contact & Support

Email: furqan@lastappstanding.com
GitHub Issues: Report bugs & request features
Discussions: Community forum

Getting Help

Check the examples in the examples/ directory
Read the documentation linked above
Search existing issues on GitHub
Open a new issue if needed

🌟 Community

If you find this project useful, please:

⭐ Star the repository
🐛 Report any issues you encounter
💡 Suggest new features or improvements
🤝 Contribute code or documentation

📊 Project Stats

Version: 0.4.6
Python: 3.9+
Platforms: Windows, Linux, macOS
License: Apache 2.0
Status: Active Development

Citation

@software{parallel_llm_2025,
  title = {Parallel-LLM: Ultra-Fast Parallel Training and Inference for Language Models},
  author = {Khan, Furqan and Last App Standing Team},
  year = {2025},
  url = {https://github.com/furqan-y-khan/parallel-llm},
  version = {0.4.6}
}

@article{parallel_generation_2025,
  title = {Parallel Token Generation: Diffusion-Based Language Model Inference},
  author = {Khan, Furqan},
  journal = {arXiv preprint},
  year = {2025},
  note = {Parallel-LLM library implementation}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.26

Nov 22, 2025

0.6.25

Nov 22, 2025

0.6.24

Nov 22, 2025

0.6.23

Nov 22, 2025

0.6.21

Nov 22, 2025

0.6.20

Nov 22, 2025

0.6.19

Nov 22, 2025

0.6.18

Nov 22, 2025

0.6.17

Nov 22, 2025

0.6.16

Nov 22, 2025

0.6.15

Nov 22, 2025

0.6.14

Nov 22, 2025

0.6.13

Nov 22, 2025

0.6.12

Nov 22, 2025

0.6.11

Nov 22, 2025

0.6.10

Nov 22, 2025

0.6.9

Nov 22, 2025

0.6.8

Nov 22, 2025

0.6.7

Nov 22, 2025

0.6.6

Nov 22, 2025

0.6.5

Nov 22, 2025

0.6.4

Nov 22, 2025

0.6.2

Nov 22, 2025

0.6.1

Nov 22, 2025

0.6.0

Nov 21, 2025

0.5.6

Nov 21, 2025

0.5.5

Nov 21, 2025

0.5.2

Nov 21, 2025

0.5.1

Nov 21, 2025

0.5.0

Nov 21, 2025

0.4.9

Nov 21, 2025

0.4.8

Nov 21, 2025

0.4.7

Nov 21, 2025

This version

0.4.6

Nov 21, 2025

0.4.5

Nov 21, 2025

0.4.2

Nov 21, 2025

0.4.1

Nov 21, 2025

0.4.0

Nov 21, 2025

0.3.0

Nov 21, 2025

0.2.0

Nov 21, 2025

0.1.1

Nov 17, 2025

0.1.0

Nov 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parallel_llm-0.4.6.tar.gz (58.4 kB view details)

Uploaded Nov 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parallel_llm-0.4.6-py3-none-any.whl (35.6 kB view details)

Uploaded Nov 21, 2025 Python 3

File details

Details for the file parallel_llm-0.4.6.tar.gz.

File metadata

Download URL: parallel_llm-0.4.6.tar.gz
Upload date: Nov 21, 2025
Size: 58.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0b4

File hashes

Hashes for parallel_llm-0.4.6.tar.gz
Algorithm	Hash digest
SHA256	`8822a238a6b115a7368b0a3c203ce060bd118521abf0b84ed3685fa31814655c`
MD5	`7bf3e4546e83c5e028d8f29f0e0375df`
BLAKE2b-256	`d9e1441f5ff272d82fc2c8d879c4710b40a02fe7fc831b66639c1f2da24dcc23`

See more details on using hashes here.

File details

Details for the file parallel_llm-0.4.6-py3-none-any.whl.

File metadata

Download URL: parallel_llm-0.4.6-py3-none-any.whl
Upload date: Nov 21, 2025
Size: 35.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0b4

File hashes

Hashes for parallel_llm-0.4.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9d2b3feae661c00261077c09b5c547e7cfade5ccdd088b87d9eb745e1167c7aa`
MD5	`7a67b57230e90240dc28a9b64156d848`
BLAKE2b-256	`f212ef0f36c2508e7805b33c98f8c0d3f69e2dcc8ddaa92205cc40a6c268d2c8`

See more details on using hashes here.

parallel-llm 0.4.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Parallel-LLM: Ultra-Fast Parallel Training & Inference

🚀 Key Features

Training

Inference

Multimodal

📦 Installation

🚀 One-Command Installation (Cross-Platform)

🛠️ Advanced Installation

Optional Dependencies

From Source

Requirements

🔥 Examples

🚀 Quick Start Examples

1. Text Generation (Unimodal Inference)

2. Image Captioning (Multimodal Inference)

3. Language Model Training (Unimodal Training)

4. Vision-Language Training (Multimodal Training)

📖 Code Examples

Basic Text Generation

Multimodal Image Understanding

Distributed Training Setup

🔧 Platform-Specific Notes

Linux (Recommended for full functionality)

Windows/macOS (CPU-only or limited GPU)

🎯 Running Examples on Different Platforms

🖥️ Command Line Interface

Compatibility Module

🏗️ Architecture

Hybrid Diffusion-Energy Framework

Key Innovations

📊 Performance

Speed Comparison (Llama-7B equivalent)

Memory Efficiency

🛠️ Advanced Features

Distributed Training

Custom Kernels

Quantization

📚 Documentation

📖 Guides

🔧 API Reference

📋 Quick References

🤝 Contributing

📄 License

🙏 Acknowledgments

Core Technologies

Research Papers & Methods

Datasets & Models

📞 Contact & Support

Getting Help

🌟 Community

📊 Project Stats

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes