Skip to main content

Lightning Thunder is a source-to-source compiler for PyTorch, enabling PyTorch programs to run on different hardware accelerators and graph compilers.

Project description

Give your PyTorch models superpowers ⚡

Thunder Thunder

 

Source-to-source compiler for PyTorch. Fast. Understandable. Extensible.


Thunder makes optimizing PyTorch models easy, augmenting them with custom kernels, fusions, quantization, distributed strategies, and more.

For end users, Thunder comes with plugins that provide model speed-ups out of the box, for optimal utilization of last generation hardware.

For performance experts, Thunder is the most ergonomic framework for understanding, modifying, and optimizing AI models through composable transformations.

✅ Run PyTorch 40% faster   ✅ Quantization                ✅ Kernel fusion        
✅ Training recipes         ✅ FP4/FP6/FP8 precision       ✅ Distributed TP/PP/DP 
✅ Inference recipes        ✅ Ready for NVIDIA Blackwell  ✅ CUDA Graphs          
✅ LLMs, non LLMs and more  ✅ Custom Triton kernels       ✅ Compose all the above

license CI testing General checks Documentation Status pre-commit.ci status

 

 

Thunder

Quick start

Install Thunder via pip (more options):

pip install torch==2.6.0 torchvision==0.21 nvfuser-cu124-torch26

pip install lightning-thunder
Advanced install options

Blackwell support

For Blackwell you'll need CUDA 12.8

pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu128
pip install --pre nvfuser-cu128 --extra-index-url https://pypi.nvidia.com

pip install lightning-thunder

Install additional executors

These are optional, feel free to mix and match

# cuDNN SDPA
pip install nvidia-cudnn-frontend

# Float8 support (this will compile from source, be patient)
pip install "transformer_engine[pytorch]"

Install Thunder bleeding edge

pip install git+https://github.com/Lightning-AI/lightning-thunder.git@main

Install Thunder for development

git clone https://github.com/Lightning-AI/lightning-thunder.git
cd lightning-thunder
pip install -e .

Hello world

Define a function or a torch module:

import torch.nn as nn

model = nn.Sequential(nn.Linear(2048, 4096), nn.ReLU(), nn.Linear(4096, 64))

Optimize it with Thunder:

import thunder
import torch

thunder_model = thunder.compile(model)

x = torch.randn(64, 2048)

y = thunder_model(x)

assert torch.testing.assert_close(y, model(x))

Examples

Speed up LLM training

Install LitGPT (without updating other dependencies)

pip install --no-deps 'litgpt[all]'

and run

import thunder
import torch
import litgpt

with torch.device("cuda"):
    model = litgpt.GPT.from_name("Llama-3.2-1B").to(torch.bfloat16)

thunder_model = thunder.compile(model)

inp = torch.ones((1, 2048), device="cuda", dtype=torch.int64)

out = thunder_model(inp)
out.sum().backward()

Speed up HuggingFace BERT inference

Install Hugging Face Transformers (recommended version is 4.50.2 and above)

pip install -U transformers

and run

import thunder
import torch
import transformers

model_name = "bert-large-uncased"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

with torch.device("cuda"):
    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.bfloat16
    )
    model.requires_grad_(False)
    model.eval()

    inp = tokenizer(["Hello world!"], return_tensors="pt")

thunder_model = thunder.compile(model)

out = thunder_model(**inp)
print(out)

Speed up HuggingFace DeepSeek R1 distill inference

Install Hugging Face Transformers (recommended version is 4.50.2 and above)

pip install -U transformers

and run

import torch
import transformers
import thunder

model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

with torch.device("cuda"):
    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.bfloat16
    )
    model.requires_grad_(False)
    model.eval()

    inp = tokenizer(["Hello world! Here's a long story"], return_tensors="pt")

thunder_model = thunder.compile(model)

out = thunder_model.generate(
    **inp, do_sample=False, cache_implementation="static", max_new_tokens=100
)
print(out)

To get an idea of the speedups, just run

python examples/quickstart/hf_llm.py

Here what you get on a L4 machine from Lightning Studio:

Eager: 2273.22ms
Thunder: 1254.39ms

81% faster 🏎️! Quite the speedup ⚡️

Speed up Vision Transformer inference

import thunder
import torch
import torchvision as tv

with torch.device("cuda"):
    model = tv.models.vit_b_16()
    model.requires_grad_(False)
    model.eval()

    inp = torch.randn(128, 3, 224, 224)

out = model(inp)

thunder_model = thunder.compile(model)

out = thunder_model(inp)

Benchmarking HF models

The script examples/quickstart/hf_benchmarks.py demonstrates how to benchmark a model for text generation, forward pass, forward pass with loss, and a full forward + backward computation.

On an H100 with torch=2.7.0 and nvfuser-cu126-torch27, running deepseek-ai/DeepSeek-R1-Distill-Llama-1.5B, the thunder executors (NVFuser and torch.compile) achieve the following speedups:

Text generation:
Thunder (nvfuser): 3.36× faster
Thunder (torch.compile): 3.42× faster

Forward pass:
Thunder (nvfuser): 1.51× faster
Thunder (torch.compile): 1.63× faster

Forward pass + loss:
Thunder (nvfuser): 1.55× faster
Thunder (torch.compile): 1.64× faster

Forward + backward:
Thunder (nvfuser): 1.51× faster
Thunder (torch.compile): 1.69× faster

Plugins

Plugins are a way to apply optimizations to a model, such as parallelism and quantization.

Thunder comes with a few plugins included of the box, but it's easy to write new ones.

  • scale up with distributed strategies with DDP, FSDP, TP ()
  • optimize numerical precision with FP8, MXFP8
  • save memory with quantization
  • reduce latency with CUDAGraphs
  • debugging and profiling

For example, in order to reduce CPU overheads via CUDAGraphs you can add "reduce-overhead" to the plugins= argument of thunder.compile:

thunder_model = thunder.compile(model, plugins="reduce-overhead")

This may or may not make a big difference. The point of Thunder is that you can easily swap optimizations in and out and explore the best combination for your setup.

How it works

Thunder works in three stages:

  1. ⚡️ It acquires your model by interpreting Python bytecode and producing a straight-line Python program

  2. ️⚡️ It transforms the computation trace to make it distributed, change precision

  3. ⚡️ It routes parts of the trace for execution

    • fusion (NVFuser, torch.compile)
    • specialized libraries (e.g. cuDNN SDPA, TransformerEngine)
    • custom Triton and CUDA kernels
    • PyTorch eager operations

 

Thunder

 

This is how the trace looks like for a simple MLP:

import thunder
import torch.nn as nn

model = nn.Sequential(nn.Linear(1024, 2048), nn.ReLU(), nn.Linear(2048, 256))

thunder_model = thunder.compile(model)
y = thunder_model(torch.randn(4, 1024))

print(thunder.last_traces(thunder_model)[-1])

This is the acquired trace, ready to be transformed and executed:

def computation(input, t_0_bias, t_0_weight, t_2_bias, t_2_weight):
# input: "cuda:0 f32[4, 1024]"
# t_0_bias: "cuda:0 f32[2048]"
# t_0_weight: "cuda:0 f32[2048, 1024]"
# t_2_bias: "cuda:0 f32[256]"
# t_2_weight: "cuda:0 f32[256, 2048]"
t3 = ltorch.linear(input, t_0_weight, t_0_bias) # t3: "cuda:0 f32[4, 2048]"
t6 = ltorch.relu(t3, False) # t6: "cuda:0 f32[4, 2048]"
t10 = ltorch.linear(t6, t_2_weight, t_2_bias) # t10: "cuda:0 f32[4, 256]"
return (t10,)

Note how Thunder's intermediate representation is just (a subset of) Python!

Performance

Thunder is fast. Here are the speed-ups obtained on a pre-training task using LitGPT on H100 and B200 hardware, relative to PyTorch eager.

Thunder

Community

Thunder is an open source project, developed in collaboration with the community with significant contributions from NVIDIA.

💬 Get help on Discord 📋 License: Apache 2.0

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightning_thunder-0.2.5.dev20250727.tar.gz (618.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lightning_thunder-0.2.5.dev20250727-py3-none-any.whl (955.2 kB view details)

Uploaded Python 3

File details

Details for the file lightning_thunder-0.2.5.dev20250727.tar.gz.

File metadata

File hashes

Hashes for lightning_thunder-0.2.5.dev20250727.tar.gz
Algorithm Hash digest
SHA256 9d82f8330bd506c170d0fdad705434a3c11c5a1bcdd12a46e5e5ce3a1ec1d4d6
MD5 98b7ef55cfb47babc825e55c112826c0
BLAKE2b-256 55744c8630827c48ee9d661bef6860739c45bb8fc2ec4845135d1f476bd17edb

See more details on using hashes here.

File details

Details for the file lightning_thunder-0.2.5.dev20250727-py3-none-any.whl.

File metadata

File hashes

Hashes for lightning_thunder-0.2.5.dev20250727-py3-none-any.whl
Algorithm Hash digest
SHA256 49bb194c2d74c4db51f6955c4e10644399705901b07f812c8ed6b900e5917a2e
MD5 ede0efd31ee045d5e626f4fde3dea951
BLAKE2b-256 626ff45017c72d3fd887e00628f765056526d9b4207d7c28b8b9aa519dc21795

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page