Skip to main content

Fused GPU kernel generation from mathematical specifications

Project description

MOLTEN

Math melted into fused GPU kernels

Fused CUDA Kernel Generation from Mathematical Specifications

GitHub · Paper · Install


Why "Molten"? Molten metal is fluid, fused, white-hot — separate elements merged into one continuous pour. That's kernel fusion: separate operations melted together into a single GPU kernel, eliminating every memory round-trip between them. The math goes in fluid. The kernel comes out solid.


The Problem

Every new model architecture needs custom CUDA. RMSNorm, RoPE, GQA, MoE routing — each requires hand-written, hand-fused kernels. Teams of 50+ engineers, weeks per kernel. Triton helps but still needs tile loops. TVM needs schedules. Nobody takes raw math and emits fused kernels.

How Molten Works

Write math. Get fused CUDA.

from molten import ZeroCompiler
from molten.ir import DataflowGraph, TensorShape

# Build the graph: RMSNorm = rms_reduce + divide + scale
g = DataflowGraph("fused_rmsnorm")
x = g.add_input("x", TensorShape([2048, 5120]))
w = g.add_input("w", TensorShape([5120]))
out = g.rms_norm(x, w, "norm")
g.add_output(out)

# Compile: 3 ops fused into 1 CUDA kernel
compiler = ZeroCompiler()
kernels = compiler.compile(g)
compiler.save(kernels, "output/")

Molten:

  1. Builds a dataflow graph from operation definitions
  2. Discovers fusion opportunities across arbitrary boundaries
  3. Generates CUDA with tiling, shared memory, vectorized access
  4. JIT compiles and caches via torch.utils.cpp_extension
Python function
       │
   FX Tracer → DataflowGraph
       │
   Optimizer (constant fold, identity elim)
       │
   Fusion Engine (elementwise, matmul+epilog, reduction)
       │
   Code Generator → .cu files

Install

pip install -e ".[dev]"

Quick Start

from molten import ZeroCompiler
from molten.ir import DataflowGraph, OpType, TensorShape

# Define the computation
g = DataflowGraph("gelu_add")
x = g.add_input("x", TensorShape([4, 512]))
bias = g.add_input("bias", TensorShape([512]))
gelu = g.add_op(OpType.GELU, [x], "gelu")
out = g.add(gelu, bias, "add")
g.add_output(out)

# Compile: 2 ops -> 1 fused kernel
compiler = ZeroCompiler(verbose=True)
kernels = compiler.compile(g)
compiler.save(kernels, "output/")

# JIT compile and run
from molten.runtime import MoltenRuntime
runtime = MoltenRuntime()
compiled = runtime.compile(kernels[0])
result = compiled(input_tensor)

Programmatic API

from molten import ZeroCompiler
from molten.ir import DataflowGraph, TensorShape

g = DataflowGraph("my_kernel")
x = g.add_input("x", TensorShape([4, 512]))
w = g.add_input("w", TensorShape([512]))
normed = g.rms_norm(x, w, "norm")
g.add_output(normed)

compiler = ZeroCompiler(verbose=True)
kernels = compiler.compile(g)
compiler.save(kernels, "output/")

Fusion Rules

Pattern Result Savings
Elementwise → Elementwise 1 kernel -1 memory round-trip per op
MatMul → Bias → Activation 1 kernel -2 round-trips
RMSNorm (reduce + normalize) 1 kernel -1 intermediate buffer
Softmax (max + exp + sum + div) 1 kernel -3 round-trips
Chain of N elementwise ops 1 kernel -(N-1) round-trips

Benchmarks — RTX 5090

Real numbers. Same session. Correctness validated (19/19 PASS, max error ~1e-6).

Molten-Generated RMSNorm vs PyTorch

Config PyTorch Eager torch.compile Molten Generated Speedup
decode (1,1,5120) 167.4 us 127.1 us 27.6 us 6.06x
prefill (1,2048,5120) 159.9 us 95.6 us 55.0 us 2.91x
long (1,8192,5120) 792.6 us 322.6 us 393.9 us 2.01x

Fused RMSNorm+SiLU*Gate (Hand-Written Target)

Config PyTorch Eager (3 ops) torch.compile Fused CUDA Speedup
decode 207.3 us 136.6 us 27.4 us 7.56x
prefill 2048 347.0 us 149.5 us 96.7 us 3.59x
long 8192 1326.5 us 457.2 us 403.1 us 3.29x

Molten-generated kernels match hand-written CUDA at small sizes. The gap at large sizes (~1.3x) is the next optimization target (vectorized loads, warp-level reduction).

Citation

@article{sharma2026molten,
  title={Molten: Fused GPU Kernel Generation from Mathematical Specifications},
  author={Sharma, Tushar},
  year={2026},
  url={https://github.com/TxsharDev/molten}
}

License

Apache-2.0 — Alia Labs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alia_molten-0.1.0.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

alia_molten-0.1.0-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file alia_molten-0.1.0.tar.gz.

File metadata

  • Download URL: alia_molten-0.1.0.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for alia_molten-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4a37a5ea0ccca6dfd0dd3b980dd2c642e20755aea032f80a319599ffae782ed9
MD5 fa6e69b4de61e6084dc43175d79d31de
BLAKE2b-256 c15e8b56d4693061919710f21b659018721a69a612405e7998d994ef0ac1640b

See more details on using hashes here.

File details

Details for the file alia_molten-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: alia_molten-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for alia_molten-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4ff556d74967cd0d6271b8e5f5b664916d58f2d4a7aae6eaf31d50714d3d8a88
MD5 46e78c8230e4b0a116930ba894d0bdd8
BLAKE2b-256 892c84114b658148b3797c15dbb673b8629762ec9cef1dc154f5a436838989b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page