Fused GPU kernel generation from mathematical specifications
Project description
MOLTEN
Math melted into fused GPU kernels
Fused CUDA Kernel Generation from Mathematical Specifications
Why "Molten"? Molten metal is fluid, fused, white-hot — separate elements merged into one continuous pour. That's kernel fusion: separate operations melted together into a single GPU kernel, eliminating every memory round-trip between them. The math goes in fluid. The kernel comes out solid.
The Problem
Every new model architecture needs custom CUDA. RMSNorm, RoPE, GQA, MoE routing — each requires hand-written, hand-fused kernels. Teams of 50+ engineers, weeks per kernel. Triton helps but still needs tile loops. TVM needs schedules. Nobody takes raw math and emits fused kernels.
How Molten Works
Write math. Get fused CUDA.
from molten import ZeroCompiler
from molten.ir import DataflowGraph, TensorShape
# Build the graph: RMSNorm = rms_reduce + divide + scale
g = DataflowGraph("fused_rmsnorm")
x = g.add_input("x", TensorShape([2048, 5120]))
w = g.add_input("w", TensorShape([5120]))
out = g.rms_norm(x, w, "norm")
g.add_output(out)
# Compile: 3 ops fused into 1 CUDA kernel
compiler = ZeroCompiler()
kernels = compiler.compile(g)
compiler.save(kernels, "output/")
Molten:
- Builds a dataflow graph from operation definitions
- Discovers fusion opportunities across arbitrary boundaries
- Generates CUDA with tiling, shared memory, vectorized access
- JIT compiles and caches via
torch.utils.cpp_extension
Python function
│
FX Tracer → DataflowGraph
│
Optimizer (constant fold, identity elim)
│
Fusion Engine (elementwise, matmul+epilog, reduction)
│
Code Generator → .cu files
Install
pip install -e ".[dev]"
Quick Start
from molten import ZeroCompiler
from molten.ir import DataflowGraph, OpType, TensorShape
# Define the computation
g = DataflowGraph("gelu_add")
x = g.add_input("x", TensorShape([4, 512]))
bias = g.add_input("bias", TensorShape([512]))
gelu = g.add_op(OpType.GELU, [x], "gelu")
out = g.add(gelu, bias, "add")
g.add_output(out)
# Compile: 2 ops -> 1 fused kernel
compiler = ZeroCompiler(verbose=True)
kernels = compiler.compile(g)
compiler.save(kernels, "output/")
# JIT compile and run
from molten.runtime import MoltenRuntime
runtime = MoltenRuntime()
compiled = runtime.compile(kernels[0])
result = compiled(input_tensor)
Programmatic API
from molten import ZeroCompiler
from molten.ir import DataflowGraph, TensorShape
g = DataflowGraph("my_kernel")
x = g.add_input("x", TensorShape([4, 512]))
w = g.add_input("w", TensorShape([512]))
normed = g.rms_norm(x, w, "norm")
g.add_output(normed)
compiler = ZeroCompiler(verbose=True)
kernels = compiler.compile(g)
compiler.save(kernels, "output/")
Fusion Rules
| Pattern | Result | Savings |
|---|---|---|
| Elementwise → Elementwise | 1 kernel | -1 memory round-trip per op |
| MatMul → Bias → Activation | 1 kernel | -2 round-trips |
| RMSNorm (reduce + normalize) | 1 kernel | -1 intermediate buffer |
| Softmax (max + exp + sum + div) | 1 kernel | -3 round-trips |
| Chain of N elementwise ops | 1 kernel | -(N-1) round-trips |
Benchmarks — RTX 5090
Real numbers. Same session. Correctness validated (19/19 PASS, max error ~1e-6).
Molten-Generated RMSNorm vs PyTorch
| Config | PyTorch Eager | torch.compile | Molten Generated | Speedup |
|---|---|---|---|---|
| decode (1,1,5120) | 167.4 us | 127.1 us | 27.6 us | 6.06x |
| prefill (1,2048,5120) | 159.9 us | 95.6 us | 55.0 us | 2.91x |
| long (1,8192,5120) | 792.6 us | 322.6 us | 393.9 us | 2.01x |
Fused RMSNorm+SiLU*Gate (Hand-Written Target)
| Config | PyTorch Eager (3 ops) | torch.compile | Fused CUDA | Speedup |
|---|---|---|---|---|
| decode | 207.3 us | 136.6 us | 27.4 us | 7.56x |
| prefill 2048 | 347.0 us | 149.5 us | 96.7 us | 3.59x |
| long 8192 | 1326.5 us | 457.2 us | 403.1 us | 3.29x |
Molten-generated kernels match hand-written CUDA at small sizes. The gap at large sizes (~1.3x) is the next optimization target (vectorized loads, warp-level reduction).
Citation
@article{sharma2026molten,
title={Molten: Fused GPU Kernel Generation from Mathematical Specifications},
author={Sharma, Tushar},
year={2026},
url={https://github.com/TxsharDev/molten}
}
License
Apache-2.0 — Alia Labs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file alia_molten-0.1.0.tar.gz.
File metadata
- Download URL: alia_molten-0.1.0.tar.gz
- Upload date:
- Size: 19.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a37a5ea0ccca6dfd0dd3b980dd2c642e20755aea032f80a319599ffae782ed9
|
|
| MD5 |
fa6e69b4de61e6084dc43175d79d31de
|
|
| BLAKE2b-256 |
c15e8b56d4693061919710f21b659018721a69a612405e7998d994ef0ac1640b
|
File details
Details for the file alia_molten-0.1.0-py3-none-any.whl.
File metadata
- Download URL: alia_molten-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ff556d74967cd0d6271b8e5f5b664916d58f2d4a7aae6eaf31d50714d3d8a88
|
|
| MD5 |
46e78c8230e4b0a116930ba894d0bdd8
|
|
| BLAKE2b-256 |
892c84114b658148b3797c15dbb673b8629762ec9cef1dc154f5a436838989b1
|