Fused GPU kernel generation from mathematical specifications

These details have not been verified by PyPI

Project links

Homepage

Project description

MOLTEN

Math melted into fused GPU kernels

Fused CUDA Kernel Generation from Mathematical Specifications

Why "Molten"? Molten metal is fluid, fused, white-hot — separate elements merged into one continuous pour. That's kernel fusion: separate operations melted together into a single GPU kernel, eliminating every memory round-trip between them. The math goes in fluid. The kernel comes out solid.

The Problem

Every new model architecture needs custom CUDA. RMSNorm, RoPE, GQA, MoE routing — each requires hand-written, hand-fused kernels. Teams of 50+ engineers, weeks per kernel. Triton helps but still needs tile loops. TVM needs schedules. Nobody takes raw math and emits fused kernels.

How Molten Works

Write math. Get fused CUDA.

from molten import ZeroCompiler
from molten.ir import DataflowGraph, TensorShape

# Build the graph: RMSNorm = rms_reduce + divide + scale
g = DataflowGraph("fused_rmsnorm")
x = g.add_input("x", TensorShape([2048, 5120]))
w = g.add_input("w", TensorShape([5120]))
out = g.rms_norm(x, w, "norm")
g.add_output(out)

# Compile: 3 ops fused into 1 CUDA kernel
compiler = ZeroCompiler()
kernels = compiler.compile(g)
compiler.save(kernels, "output/")

Molten:

Builds a dataflow graph from operation definitions
Discovers fusion opportunities across arbitrary boundaries
Generates CUDA with tiling, shared memory, vectorized access
JIT compiles and caches via torch.utils.cpp_extension

Python function
       │
   FX Tracer → DataflowGraph
       │
   Optimizer (constant fold, identity elim)
       │
   Fusion Engine (elementwise, matmul+epilog, reduction)
       │
   Code Generator → .cu files

Install

pip install -e ".[dev]"

Quick Start

from molten import ZeroCompiler
from molten.ir import DataflowGraph, OpType, TensorShape

# Define the computation
g = DataflowGraph("gelu_add")
x = g.add_input("x", TensorShape([4, 512]))
bias = g.add_input("bias", TensorShape([512]))
gelu = g.add_op(OpType.GELU, [x], "gelu")
out = g.add(gelu, bias, "add")
g.add_output(out)

# Compile: 2 ops -> 1 fused kernel
compiler = ZeroCompiler(verbose=True)
kernels = compiler.compile(g)
compiler.save(kernels, "output/")

# JIT compile and run
from molten.runtime import MoltenRuntime
runtime = MoltenRuntime()
compiled = runtime.compile(kernels[0])
result = compiled(input_tensor)

Programmatic API

from molten import ZeroCompiler
from molten.ir import DataflowGraph, TensorShape

g = DataflowGraph("my_kernel")
x = g.add_input("x", TensorShape([4, 512]))
w = g.add_input("w", TensorShape([512]))
normed = g.rms_norm(x, w, "norm")
g.add_output(normed)

compiler = ZeroCompiler(verbose=True)
kernels = compiler.compile(g)
compiler.save(kernels, "output/")

Fusion Rules

Pattern	Result	Savings
Elementwise → Elementwise	1 kernel	-1 memory round-trip per op
MatMul → Bias → Activation	1 kernel	-2 round-trips
RMSNorm (reduce + normalize)	1 kernel	-1 intermediate buffer
Softmax (max + exp + sum + div)	1 kernel	-3 round-trips
Chain of N elementwise ops	1 kernel	-(N-1) round-trips

Benchmarks — RTX 5090

Real numbers. Same session. Correctness validated (19/19 PASS, max error ~1e-6).

Molten-Generated RMSNorm vs PyTorch

Config	PyTorch Eager	torch.compile	Molten Generated	Speedup
decode (1,1,5120)	167.4 us	127.1 us	27.6 us	6.06x
prefill (1,2048,5120)	159.9 us	95.6 us	55.0 us	2.91x
long (1,8192,5120)	792.6 us	322.6 us	393.9 us	2.01x

Fused RMSNorm+SiLU*Gate (Hand-Written Target)

Config	PyTorch Eager (3 ops)	torch.compile	Fused CUDA	Speedup
decode	207.3 us	136.6 us	27.4 us	7.56x
prefill 2048	347.0 us	149.5 us	96.7 us	3.59x
long 8192	1326.5 us	457.2 us	403.1 us	3.29x

Molten-generated kernels match hand-written CUDA at small sizes. The gap at large sizes (~1.3x) is the next optimization target (vectorized loads, warp-level reduction).

Citation

@article{sharma2026molten,
  title={Molten: Fused GPU Kernel Generation from Mathematical Specifications},
  author={Sharma, Tushar},
  year={2026},
  url={https://github.com/TxsharDev/molten}
}

License

Apache-2.0 — Alia Labs

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Jun 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alia_molten-0.1.0.tar.gz (19.5 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

alia_molten-0.1.0-py3-none-any.whl (21.0 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file alia_molten-0.1.0.tar.gz.

File metadata

Download URL: alia_molten-0.1.0.tar.gz
Upload date: Jun 11, 2026
Size: 19.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for alia_molten-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4a37a5ea0ccca6dfd0dd3b980dd2c642e20755aea032f80a319599ffae782ed9`
MD5	`fa6e69b4de61e6084dc43175d79d31de`
BLAKE2b-256	`c15e8b56d4693061919710f21b659018721a69a612405e7998d994ef0ac1640b`

See more details on using hashes here.

File details

Details for the file alia_molten-0.1.0-py3-none-any.whl.

File metadata

Download URL: alia_molten-0.1.0-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 21.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for alia_molten-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4ff556d74967cd0d6271b8e5f5b664916d58f2d4a7aae6eaf31d50714d3d8a88`
MD5	`46e78c8230e4b0a116930ba894d0bdd8`
BLAKE2b-256	`892c84114b658148b3797c15dbb673b8629762ec9cef1dc154f5a436838989b1`

See more details on using hashes here.

alia-molten 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

MOLTEN

The Problem

How Molten Works

Install

Quick Start

Programmatic API

Fusion Rules

Benchmarks — RTX 5090

Molten-Generated RMSNorm vs PyTorch

Fused RMSNorm+SiLU*Gate (Hand-Written Target)

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes