Skip to main content

Ultra-fast parallel training and inference for language models

Project description

Parallel-LLM: Ultra-Fast Parallel Training & Inference

PyPI version License Python 3.9+

Parallel-LLM is a production-ready library for training and inference of language models with revolutionary parallel token generation. Generate all tokens at once instead of one-by-one using our hybrid diffusion-energy architecture.

๐Ÿš€ Key Features

Training

  • Full Parallelism: Data + Tensor + Pipeline + Expert parallelism
  • FSDP2: PyTorch's latest fully sharded data parallel with DTensor
  • DeepSpeed ZeRO: Stages 1, 2, 3 with CPU offloading
  • Flash Attention 3: Up to 75% GPU utilization on H100
  • torch.compile: Automatic kernel fusion and optimization
  • Mixed Precision: FP16, BF16, FP8 support
  • Gradient Checkpointing: Selective activation checkpointing

Inference

  • Parallel Generation: Generate 64+ tokens simultaneously
  • 1.5-3ร— Faster: Compared to autoregressive decoding
  • Paged KV Cache: Memory-efficient attention like vLLM
  • CUDA Graphs: Zero CPU overhead
  • Continuous Batching: Dynamic request handling
  • Speculative Decoding: Draft model verification

Multimodal

  • Vision-Language Models: CLIP-style contrastive learning
  • Cross-Modal Fusion: Attention-based alignment
  • Unified Architecture: Single model for text + vision

๐Ÿ“ฆ Installation

๐Ÿš€ One-Command Installation (Cross-Platform)

pip install parallel-llm

This single command works on Windows, Linux, and macOS! The installer automatically detects your platform and installs the appropriate PyTorch version.

๐Ÿ› ๏ธ Advanced Installation

For more control or if the one-command install fails:

# Download and run the cross-platform installer
curl -fsSL https://raw.githubusercontent.com/furqan-y-khan/parallel-llm/main/install_parallel_llm.py | python3

Or manually:

# Step 1: Install PyTorch (platform-specific)
# Windows:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # CUDA
# OR
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu    # CPU only

# Linux:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # CUDA
# OR
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu    # CPU only

# macOS:
pip install torch torchvision torchaudio  # CPU/MPS support included

# Step 2: Install Parallel-LLM
pip install parallel-llm

Optional Dependencies

Install with specific features (all cross-platform where possible):

# GPU acceleration (may not be available on all platforms)
pip install parallel-llm[gpu]

# Distributed training (may not be available on all platforms)
pip install parallel-llm[distributed]

# Multimodal models (cross-platform)
pip install parallel-llm[multimodal]

# Inference optimization (may not be available on all platforms)
pip install parallel-llm[inference]

# Logging and monitoring (cross-platform)
pip install parallel-llm[logging]

# Dataset utilities (cross-platform)
pip install parallel-llm[datasets]

# Development tools (cross-platform)
pip install parallel-llm[dev]

# Install everything
pip install parallel-llm[all]

From Source

git clone https://github.com/furqan-y-khan/parallel-llm
cd parallel-llm
pip install -e .

Requirements

  • Python >= 3.9
  • PyTorch >= 2.2.0 (automatically installed with platform-specific version)
  • No CUDA required - works on CPU-only systems
  • Optional: CUDA >= 11.8 for GPU acceleration
  • Optional: 16GB+ GPU memory recommended for full functionality

๐Ÿ”ฅ Quick Start

Training a Unimodal LLM

import torch
from parallel_llm import DiffusionTransformer, ModelConfig, TrainingConfig, DistributedTrainer

# Configure model
model_config = ModelConfig(
    vocab_size=50257,
    hidden_size=2048,
    num_hidden_layers=24,
    num_attention_heads=16,
    use_flash_attention=True,
)

# Create model
model = DiffusionTransformer(model_config)

# Configure training
train_config = TrainingConfig(
    batch_size=8,
    learning_rate=3e-4,
    use_fsdp=True,
    fsdp_sharding_strategy="full",
    mixed_precision="bf16",
    use_torch_compile=True,
    torch_compile_mode="max-autotune",
)

# Create trainer
trainer = DistributedTrainer(
    model=model,
    train_config=train_config,
    model_config=model_config,
    train_dataloader=train_dataloader,
    eval_dataloader=eval_dataloader,
)

# Train!
trainer.train()

Parallel Generation (Inference)

from parallel_llm import ParallelGenerator, GenerationConfig

# Configure generation
gen_config = GenerationConfig(
    max_new_tokens=512,
    temperature=1.0,
    num_refinement_steps=5,
    confidence_threshold=0.9,
)

# Create generator
generator = ParallelGenerator(model, gen_config, use_cuda_graphs=True)

# Generate (all 512 tokens in ~5 forward passes!)
prompt = torch.tensor([[1, 2, 3, 4, 5]])  # Your prompt tokens
generated = generator.generate(prompt)

print(f"Generated {generated.shape[1]} tokens")

Multimodal Training

from parallel_llm import MultimodalModel, MultimodalConfig

# Configure multimodal model
config = MultimodalConfig(
    # Text config
    vocab_size=50257,
    hidden_size=2048,
    num_hidden_layers=24,

    # Vision config
    vision_encoder="clip",
    image_size=224,
    patch_size=16,
    vision_hidden_size=1024,

    # Fusion
    fusion_type="cross_attention",
    use_contrastive=True,
)

# Create model
model = MultimodalModel(config)

# Train with image-text pairs
# ... (similar to unimodal training)

๐Ÿ—๏ธ Architecture

Hybrid Diffusion-Energy Framework

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Input: [MASK] [MASK] [MASK] ... [MASK] โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ†“
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  Diffusion Transformer     โ”‚
    โ”‚  (Bidirectional Attention) โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ†“
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  Multi-Token Predictions   โ”‚
    โ”‚  With Confidence Scores    โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ†“
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  Energy-Based Refinement   โ”‚
    โ”‚  (Sequence-Level Scoring)  โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ†“
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  Adaptive Masking          โ”‚
    โ”‚  (Keep high-confidence)    โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ†“
    Output: All tokens generated

Key Innovations

  1. Masked Diffusion: Start with all [MASK] tokens, iteratively refine
  2. Bidirectional Attention: Each token sees entire context
  3. Confidence-Based Masking: Adaptively accept high-confidence predictions
  4. Energy Model: Global sequence coherence checking
  5. Parallel Decoding: 64+ tokens per forward pass

๐Ÿ“Š Performance

Speed Comparison (Llama-7B equivalent)

Method Tokens/sec Speedup
Autoregressive (HF) 25 1.0ร—
vLLM 45 1.8ร—
Parallel-LLM 75 3.0ร—

Memory Efficiency

Batch Size Standard Parallel-LLM
1 16 GB 12 GB
8 128 GB 48 GB
32 OOM 96 GB

๐Ÿ› ๏ธ Advanced Features

Distributed Training

# Launch with torchrun
torchrun --nproc_per_node=8 train.py \
    --use-fsdp \
    --fsdp-sharding-strategy full \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 1

Custom Kernels

from parallel_llm.kernels import fused_attention, parallel_decode

# Use optimized Triton kernels
output = fused_attention(query, key, value, use_flash=True)

# Parallel token decoding
tokens = parallel_decode(logits, num_parallel=64)

Quantization

from parallel_llm.quantization import quantize_model

# Quantize to INT8 or FP8
model = quantize_model(model, precision="fp8")

๐Ÿ“š Documentation

๐Ÿค Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

๐Ÿ“„ License

Apache 2.0 License. See LICENSE for details.

๐Ÿ™ Acknowledgments

Built on research from:

  • FlashAttention (Dao et al.)
  • Diffusion Language Models (various)
  • DeepSpeed ZeRO (Microsoft)
  • vLLM (UC Berkeley)
  • PyTorch FSDP (Meta)

๐Ÿ“ž Contact

๐ŸŒŸ Star History

If you find this project useful, please give it a star! โญ

Citation

@software{parallel_llm,
  title = {Parallel-LLM: Ultra-Fast Parallel Training and Inference},
  author = {Last App Standing Team},
  year = {2025},
  url = {https://github.com/furqan-y-khan/parallel-llm}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parallel_llm-0.4.5.tar.gz (53.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parallel_llm-0.4.5-py3-none-any.whl (33.3 kB view details)

Uploaded Python 3

File details

Details for the file parallel_llm-0.4.5.tar.gz.

File metadata

  • Download URL: parallel_llm-0.4.5.tar.gz
  • Upload date:
  • Size: 53.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0b4

File hashes

Hashes for parallel_llm-0.4.5.tar.gz
Algorithm Hash digest
SHA256 6bba536743c8433f0481b653ed7c2a777b34bd823a0832059f7b2ed84a52c7fb
MD5 84c4a85d4febb39c41365a465df6b4bc
BLAKE2b-256 9512cd6f460d04616aa6c6720e0716aaa9e5911b8581027beedea4b2ac763a15

See more details on using hashes here.

File details

Details for the file parallel_llm-0.4.5-py3-none-any.whl.

File metadata

  • Download URL: parallel_llm-0.4.5-py3-none-any.whl
  • Upload date:
  • Size: 33.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0b4

File hashes

Hashes for parallel_llm-0.4.5-py3-none-any.whl
Algorithm Hash digest
SHA256 91074631b57de3e0596d8ff5f64641a547b8f046c1e739c9f36ae70da80f2dde
MD5 c36e5005eff52ca8f9c1605febd1d7fb
BLAKE2b-256 b523f25b171a36401301840f581bc4b7f10ee56e1701634ba56279e463005427

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page