Skip to main content

Ultra-fast parallel training and inference for language models

Project description

Parallel-LLM: Ultra-Fast Parallel Training & Inference

PyPI version License Python 3.9+

Parallel-LLM is a production-ready, cross-platform library for training and inference of language models with revolutionary parallel token generation. Generate all tokens at once instead of one-by-one using our hybrid diffusion-energy architecture.

๐Ÿš€ Cross-Platform Support: Works seamlessly on Windows, Linux, and macOS with graceful degradation for optional dependencies. One-command installation works everywhere!

๐Ÿš€ Key Features

Training

  • Full Parallelism: Data + Tensor + Pipeline + Expert parallelism
  • FSDP2: PyTorch's latest fully sharded data parallel with DTensor
  • DeepSpeed ZeRO: Stages 1, 2, 3 with CPU offloading
  • Flash Attention 3: Up to 75% GPU utilization on H100
  • torch.compile: Automatic kernel fusion and optimization
  • Mixed Precision: FP16, BF16, FP8 support
  • Gradient Checkpointing: Selective activation checkpointing

Inference

  • Parallel Generation: Generate 64+ tokens simultaneously
  • 1.5-3ร— Faster: Compared to autoregressive decoding
  • Paged KV Cache: Memory-efficient attention like vLLM
  • CUDA Graphs: Zero CPU overhead
  • Continuous Batching: Dynamic request handling
  • Speculative Decoding: Draft model verification

Multimodal

  • Vision-Language Models: CLIP-style contrastive learning
  • Cross-Modal Fusion: Attention-based alignment
  • Unified Architecture: Single model for text + vision

๐Ÿ“ฆ Installation

๐Ÿš€ One-Command Installation (Cross-Platform)

pip install parallel-llm

This single command works on Windows, Linux, and macOS! The installer automatically detects your platform and installs the appropriate PyTorch version.

๐Ÿ› ๏ธ Advanced Installation

For more control or if the one-command install fails:

# Download and run the cross-platform installer
curl -fsSL https://raw.githubusercontent.com/furqan-y-khan/parallel-llm/main/install_parallel_llm.py | python3

# Or download and run locally
wget https://raw.githubusercontent.com/furqan-y-khan/parallel-llm/main/install_parallel_llm.py
python install_parallel_llm.py

Or manually:

# Step 1: Install PyTorch (platform-specific)
# Windows:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # CUDA
# OR
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu    # CPU only

# Linux:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # CUDA
# OR
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu    # CPU only

# macOS:
pip install torch torchvision torchaudio  # CPU/MPS support included

# Step 2: Install Parallel-LLM
pip install parallel-llm

Optional Dependencies

Install with specific features (all cross-platform where possible):

# GPU acceleration (may not be available on all platforms)
pip install parallel-llm[gpu]

# Distributed training (may not be available on all platforms)
pip install parallel-llm[distributed]

# Multimodal models (cross-platform)
pip install parallel-llm[multimodal]

# Inference optimization (may not be available on all platforms)
pip install parallel-llm[inference]

# Logging and monitoring (cross-platform)
pip install parallel-llm[logging]

# Dataset utilities (cross-platform)
pip install parallel-llm[datasets]

# Development tools (cross-platform)
pip install parallel-llm[dev]

# Install everything
pip install parallel-llm[all]

From Source

git clone https://github.com/furqan-y-khan/parallel-llm
cd parallel-llm
pip install -e .

Requirements

  • Python >= 3.9
  • PyTorch >= 2.2.0 (automatically installed with platform-specific version)
  • No CUDA required - works on CPU-only systems
  • Optional: CUDA >= 11.8 for GPU acceleration
  • Optional: 16GB+ GPU memory recommended for full functionality

๐Ÿ”ฅ Examples

๐Ÿš€ Quick Start Examples

All examples are available in the examples/ directory and include cross-platform compatibility checks.

1. Text Generation (Unimodal Inference)

File: examples/inference_unimodal.py

Demonstrates parallel text generation using the DiffusionTransformer architecture.

cd examples
python inference_unimodal.py

Features:

  • Parallel token generation (64 tokens simultaneously)
  • GPT-2 tokenizer integration
  • Adaptive refinement based on confidence scores
  • CUDA graphs for maximum performance

2. Image Captioning (Multimodal Inference)

File: examples/inference_multimodal.py

Shows how to generate captions for images using multimodal models.

cd examples
python inference_multimodal.py

Features:

  • Vision-language understanding
  • ViT image encoder integration
  • Cross-modal attention fusion
  • COCO dataset image processing

3. Language Model Training (Unimodal Training)

File: examples/train_unimodal.py

Complete distributed training setup for text-only language models.

cd examples
python train_unimodal.py

Features:

  • FSDP (Fully Sharded Data Parallel)
  • Mixed precision training (BF16/FP16)
  • Gradient checkpointing
  • WikiText-2 dataset integration
  • Distributed training with NCCL

4. Vision-Language Training (Multimodal Training)

File: examples/train_multimodal.py

Training multimodal models that understand both text and images.

cd examples
python train_multimodal.py

Features:

  • Contrastive learning (CLIP-style)
  • Cross-attention fusion
  • Image-text pair processing
  • Gradient checkpointing for memory efficiency

๐Ÿ“– Code Examples

Basic Text Generation

from parallel_llm import DiffusionTransformer, ModelConfig, ParallelGenerator, GenerationConfig

# Configure model
config = ModelConfig(
    vocab_size=50257,  # GPT-2 vocabulary
    hidden_size=1024,
    num_hidden_layers=12,
    num_attention_heads=16,
    use_flash_attention=True,
)

# Create model
model = DiffusionTransformer(config)

# Configure generation
gen_config = GenerationConfig(
    max_new_tokens=128,
    num_parallel_tokens=64,  # Generate 64 tokens at once!
    num_refinement_steps=5,
    temperature=0.8,
    top_k=50,
)

# Create generator
generator = ParallelGenerator(
    model=model,
    config=gen_config,
    use_kv_cache=True,
    use_cuda_graphs=True
)

# Generate text
prompt = "The future of AI is"
generated_tokens = generator.generate(tokenizer.encode(prompt))
generated_text = tokenizer.decode(generated_tokens[0])

Multimodal Image Understanding

from parallel_llm import DiffusionTransformer, MultimodalConfig
from transformers import AutoImageProcessor, AutoTokenizer

# Configure multimodal model
config = MultimodalConfig(
    vocab_size=50257,
    vision_encoder="vit",
    image_size=224,
    patch_size=16,
    fusion_type="cross_attention",
    use_contrastive=True,
)

# Create model
model = DiffusionTransformer(config)

# Process image and text
image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Load and process image
image = Image.open("path/to/image.jpg")
pixel_values = image_processor(images=image, return_tensors="pt").pixel_values

# Prepare text prompt
text = "Describe this image:"
input_ids = tokenizer.encode(text, return_tensors="pt")

# Generate caption
outputs = generator.generate(
    input_ids=input_ids,
    pixel_values=pixel_values
)
caption = tokenizer.decode(outputs[0])

Distributed Training Setup

from parallel_llm import DiffusionTransformer, TrainingConfig, DistributedTrainer
from torch.utils.data import DataLoader

# Configure training
train_config = TrainingConfig(
    output_dir="./checkpoints",
    num_train_steps=50000,
    batch_size=8,
    learning_rate=3e-4,
    warmup_steps=1000,
    use_fsdp=True,  # Fully Sharded Data Parallel
    fsdp_sharding_strategy="full",
    mixed_precision="bf16",
    gradient_checkpointing=True,
    logging_steps=10,
    save_steps=1000,
)

# Create model and trainer
model = DiffusionTransformer(model_config)
trainer = DistributedTrainer(
    model=model,
    train_config=train_config,
    model_config=model_config,
    train_dataloader=train_dataloader,
)

# Train (supports multi-GPU, multi-node)
trainer.train()

๐Ÿ”ง Platform-Specific Notes

Linux (Recommended for full functionality)

# Install with all GPU features
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install parallel-llm[gpu,distributed,inference]

Windows/macOS (CPU-only or limited GPU)

# CPU-only installation
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install parallel-llm[multimodal,logging]

# macOS with MPS (Metal Performance Shaders)
pip install torch torchvision torchaudio  # Includes MPS support

๐ŸŽฏ Running Examples on Different Platforms

All examples include automatic platform detection and provide helpful guidance:

  • On Linux with CUDA: Full functionality with GPU acceleration
  • On Windows/macOS: CPU-only mode with clear instructions to switch to Linux
  • Missing dependencies: Graceful degradation with installation guidance

Each example checks for required dependencies and provides platform-specific installation instructions if something is missing.

๐Ÿ–ฅ๏ธ Command Line Interface

Parallel-LLM includes CLI tools for easy training and inference:

# Train a model
parallel-llm-train --config config.yaml --output-dir ./checkpoints

# Run inference
parallel-llm-infer --model-path ./checkpoints/model.bin --prompt "Hello world"

Compatibility Module

The library includes a cross-platform compatibility module:

from parallel_llm import compat

# Check PyTorch CUDA availability
cuda_ok, cuda_msg = compat.check_pytorch_cuda()
print(f"CUDA: {cuda_msg}")

# Get optimal device
device, device_msg = compat.get_optimal_device()
print(f"Using: {device_msg}")

# Get platform-specific installation instructions
print(compat.get_installation_instructions())

๐Ÿ—๏ธ Architecture

Hybrid Diffusion-Energy Framework

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Input: [MASK] [MASK] [MASK] ... [MASK] โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ†“
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  Diffusion Transformer     โ”‚
    โ”‚  (Bidirectional Attention) โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ†“
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  Multi-Token Predictions   โ”‚
    โ”‚  With Confidence Scores    โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ†“
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  Energy-Based Refinement   โ”‚
    โ”‚  (Sequence-Level Scoring)  โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ†“
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  Adaptive Masking          โ”‚
    โ”‚  (Keep high-confidence)    โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ†“
    Output: All tokens generated

Key Innovations

  1. Masked Diffusion: Start with all [MASK] tokens, iteratively refine
  2. Bidirectional Attention: Each token sees entire context
  3. Confidence-Based Masking: Adaptively accept high-confidence predictions
  4. Energy Model: Global sequence coherence checking
  5. Parallel Decoding: 64+ tokens per forward pass

๐Ÿ“Š Performance

Speed Comparison (Llama-7B equivalent)

Method Tokens/sec Speedup
Autoregressive (HF) 25 1.0ร—
vLLM 45 1.8ร—
Parallel-LLM 75 3.0ร—

Memory Efficiency

Batch Size Standard Parallel-LLM
1 16 GB 12 GB
8 128 GB 48 GB
32 OOM 96 GB

๐Ÿ› ๏ธ Advanced Features

Distributed Training

# Launch with torchrun
torchrun --nproc_per_node=8 train.py \
    --use-fsdp \
    --fsdp-sharding-strategy full \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 1

Custom Kernels

from parallel_llm.kernels import fused_attention, parallel_decode

# Use optimized Triton kernels
output = fused_attention(query, key, value, use_flash=True)

# Parallel token decoding
tokens = parallel_decode(logits, num_parallel=64)

Quantization

from parallel_llm.quantization import quantize_model

# Quantize to INT8 or FP8
model = quantize_model(model, precision="fp8")

๐Ÿ“š Documentation

๐Ÿ“– Guides

๐Ÿ”ง API Reference

๐Ÿ“‹ Quick References

๐Ÿค Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

๐Ÿ“„ License

Apache 2.0 License. See LICENSE for details.

๐Ÿ™ Acknowledgments

Built on research and technologies from:

Core Technologies

  • PyTorch - Deep learning framework
  • Transformers (Hugging Face) - Model architectures
  • Accelerate (Hugging Face) - Distributed training utilities

Research Papers & Methods

  • FlashAttention (Dao et al.) - Efficient attention computation
  • Diffusion Language Models - Parallel generation techniques
  • DeepSpeed ZeRO (Microsoft) - Memory-efficient training
  • vLLM (UC Berkeley) - High-throughput inference
  • PyTorch FSDP (Meta) - Distributed data parallel

Datasets & Models

  • GPT-2 (OpenAI) - Base model architecture
  • ViT (Google) - Vision transformer
  • CLIP (OpenAI) - Vision-language understanding
  • WikiText & COCO - Training datasets

๐Ÿ“ž Contact & Support

Getting Help

  1. Check the examples in the examples/ directory
  2. Read the documentation linked above
  3. Search existing issues on GitHub
  4. Open a new issue if needed

๐ŸŒŸ Community

If you find this project useful, please:

  • โญ Star the repository
  • ๐Ÿ› Report any issues you encounter
  • ๐Ÿ’ก Suggest new features or improvements
  • ๐Ÿค Contribute code or documentation

๐Ÿ“Š Project Stats

  • Version: 0.4.6
  • Python: 3.9+
  • Platforms: Windows, Linux, macOS
  • License: Apache 2.0
  • Status: Active Development

Citation

@software{parallel_llm_2025,
  title = {Parallel-LLM: Ultra-Fast Parallel Training and Inference for Language Models},
  author = {Khan, Furqan and Last App Standing Team},
  year = {2025},
  url = {https://github.com/furqan-y-khan/parallel-llm},
  version = {0.4.6}
}
@article{parallel_generation_2025,
  title = {Parallel Token Generation: Diffusion-Based Language Model Inference},
  author = {Khan, Furqan},
  journal = {arXiv preprint},
  year = {2025},
  note = {Parallel-LLM library implementation}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parallel_llm-0.4.6.tar.gz (58.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parallel_llm-0.4.6-py3-none-any.whl (35.6 kB view details)

Uploaded Python 3

File details

Details for the file parallel_llm-0.4.6.tar.gz.

File metadata

  • Download URL: parallel_llm-0.4.6.tar.gz
  • Upload date:
  • Size: 58.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0b4

File hashes

Hashes for parallel_llm-0.4.6.tar.gz
Algorithm Hash digest
SHA256 8822a238a6b115a7368b0a3c203ce060bd118521abf0b84ed3685fa31814655c
MD5 7bf3e4546e83c5e028d8f29f0e0375df
BLAKE2b-256 d9e1441f5ff272d82fc2c8d879c4710b40a02fe7fc831b66639c1f2da24dcc23

See more details on using hashes here.

File details

Details for the file parallel_llm-0.4.6-py3-none-any.whl.

File metadata

  • Download URL: parallel_llm-0.4.6-py3-none-any.whl
  • Upload date:
  • Size: 35.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0b4

File hashes

Hashes for parallel_llm-0.4.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9d2b3feae661c00261077c09b5c547e7cfade5ccdd088b87d9eb745e1167c7aa
MD5 7a67b57230e90240dc28a9b64156d848
BLAKE2b-256 f212ef0f36c2508e7805b33c98f8c0d3f69e2dcc8ddaa92205cc40a6c268d2c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page