Plain-English -> verified CUDA kernels via Claude-driven agent loop
Project description
cuda-engine
Plain English + a slow PyTorch reference → a verified, benchmarked, annotated CUDA kernel.
cuda-engine is a Python library and CLI that turns a natural-language description and a reference PyTorch function into a CUDA kernel that compiles, matches the reference within tolerance on a real GPU, and benchmarks against torch.compile. It uses Claude (Anthropic) for a 5-stage agent loop (interview → codegen → correctness → performance → polish) with Nsight-driven perf repair and Sonnet→Opus escalation when budgets bust.
Status: pre-1.0. Implementation is feature-complete through M3; v1.0 release gate (M4) is pending. Internal regression suite, eval runner, nightly CI, and torch.compile baseline measurement are all in place. See docs/milestones/M3-evidence.md for the most recent eval results.
What it does
import torch
from cuda_engine import synthesize
def rms_norm(x):
return x * (x.float().pow(2).mean(dim=-1, keepdim=True) + 1e-5).rsqrt().to(x.dtype)
result = synthesize(
prompt="Generate a fp16 RMSNorm kernel without gamma over the last dimension.",
reference=rms_norm,
target="sm_80",
)
assert result.passed
assert result.correctness.passed # verified vs the reference
assert result.performance.below_target is False # ≥1.0× torch.compile
print(f"Speedup: {result.performance.speedup_vs_torch_compile:.2f}x")
print(f"Kernel: {result.artifacts_dir}/stage5_polish/final/kernel.cu")
Each synthesize() call produces a run directory under ~/.cache/cuda_engine/runs/<run_id>/ containing every prompt sent, every LLM response, every kernel attempt, the final kernel source, the compiled shared object, and the full synthesis trace.
Quickstart
Install
Requires Python 3.11+, CUDA 12.x toolchain (nvcc), PyTorch 2.4+, and an A100-class GPU for end-to-end runs.
pip install cuda-engine # post-v1.0 release
# or, from source:
git clone https://github.com/shivnarainms22/Cuda-Engine.git
cd Cuda-Engine
pip install -e ".[dev]"
Set your Anthropic key:
export ANTHROPIC_API_KEY=sk-ant-...
CLI
# Synthesize a single kernel
cuda-engine synthesize \
--prompt "Generate a fp16 RMSNorm kernel without gamma over the last dimension." \
--reference path/to/rms_norm.py \
--target sm_80
# Inspect a previous run
cuda-engine inspect <run_id>
# Run the internal eval suite (30 kernels)
cuda-engine eval --suite internal --out evals/results/2026-05-12 --resume
path/to/rms_norm.py should define either a top-level REFERENCE variable or a top-level reference() function.
Library
from cuda_engine import SynthesisConfig, synthesize
from cuda_engine.config import RetryBudgets
result = synthesize(
prompt="...",
reference=my_pytorch_fn,
target="sm_80",
config=SynthesisConfig(
retry_budgets=RetryBudgets(codegen=3, performance=2),
escalate_to_opus_on_bust=True,
perf_target_speedup_vs_torch_compile=1.0,
),
)
See docs/cost.md for tuning retry budgets to bound API spend.
How it works
prompt + reference.py
│
▼
┌─────────────────────┐
│ Stage 1: Interview │ → KernelSpec (frozen contract)
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Stage 2: Codegen │ → kernel.cu + compile.log (hard retry budget)
└─────────────────────┘
│
▼
┌─────────────────────┐ fail → repair via Stage 2
│ Stage 3: Correct. │ ──────────┐
│ HARD GATE │ │
└─────────────────────┘ │
│ pass ▼
▼ (loop until pass or budget exhausted)
┌─────────────────────┐
│ Stage 4: Perf │ → benchmark vs torch.compile
│ SOFT GATE │ Nsight-driven repair loop
│ │ Sonnet → Opus escalation
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Stage 5: Polish │ → annotated kernel.cu (re-verified)
└─────────────────────┘
│
▼
SynthesisResult + run_dir
- Hard gate (Stage 3): kernels that don't match the reference within tolerance fail outright. No exceptions.
- Soft gate (Stage 4): kernels below the perf target still ship, but with
below_target=Trueand a warning. Stage 4 burns its retry budget on Nsight-driven optimizations, then optionally escalates to Opus. - Subprocess isolation: all GPU work happens in a subprocess child. Crashes in user kernels (segfaults, illegal memory access, OOM) don't take down the orchestrator.
Design document: docs/superpowers/specs/2026-04-26-cuda-synthesis-engine-design.md.
Eval results
The internal regression suite has 30 hand-curated kernels covering elementwise ops, reductions, and simple fused kernels. All speedups are measured on an A100 (sm_80) against the fastest torch.compile mode (best of default / max-autotune / reduce-overhead) at N≈16M, so a win means beating torch.compile at its best.
Internal suite — 30/30, A100, 2026-06-01 (M3-evidence.md):
- Functional pass rate: 30/30 (100%).
- Median speedup vs torch.compile: 1.04×; p25: 1.00×.
- fast_1: 24/30 (80%) kernels strictly faster than torch.compile.
- Biggest wins:
topk_fp3212.5× (inductor falls back to a slow sort),masked_mean2.6×,cumulative_max1.45×,softmax_lastdim1.33×. As expected for bandwidth-bound elementwise ops, those sit at parity (torch.compile is already at the HBM roofline); the wins come from reductions/scans.
KernelBench external subset (12 unseen, in-scope level1 ops): 9/9 functional on the kernels run so far (remaining 3 pending a credit top-up).
An earlier baseline bug measured against
torch.compile's slowest mode (reduce-overhead) at too-small N, which inflated speedups (one kernel read 9.7× when the honest number is ~parity). Fixed in commit21f3b2b; all numbers above use the corrected best-mode baseline.
Scope
In scope for v1
- Kernel categories: elementwise + simple fused (RMSNorm, layernorm, GELU/SiLU/sigmoid variants, GLU/SwiGLU/GEGLU fusions, dropout-fused) and reductions/scans (sum, mean, argmax, top-k, prefix-sum, masked-mean).
- Targets: codegen for
sm_80/sm_90/sm_100; runtime verification onsm_80only. - LLM: Anthropic Claude Sonnet 4.6 default, Opus 4.7 escalation. Prompt caching enabled.
- Eval suites: 30-kernel internal regression + filtered KernelBench subset.
Out of scope for v1
- GEMM, matmul, attention kernels (CUTLASS and FlashAttention dominate; deferred to v2/v3).
- Multi-GPU, multi-node, rack-scale orchestration.
- Formal verification (SMT race-freedom proofs).
- Cross-LLM-provider support (Anthropic-only behind a single seam).
- Backward-pass kernel synthesis, autograd custom ops.
- VS Code / IDE integrations.
Cost
Per-kernel envelope under default config:
| Scenario | USD |
|---|---|
| Happy path | ~$0.10–0.20 |
| Typical with retries | ~$0.15–0.40 |
| Hard kernel | ~$0.30–0.80 |
| With Opus escalation | ~$0.80–2.00 |
Full eval suite (30 kernels): ~$5–20 depending on retries. See docs/cost.md for the per-stage breakdown and the four config knobs to bound spend.
Privacy
cuda-engine writes full LLM transcripts and reference source code to ~/.cache/cuda_engine/runs/<run_id>/. No telemetry, no third-party logging. All network traffic is to api.anthropic.com over TLS. See docs/privacy.md for how to keep proprietary references out of artifact directories.
Examples
examples/notebook.ipynb— Colab quickstart (5 cells).examples/web_demo.py— Streamlit live demo.examples/kernels/— worked examples with prompt, reference, generated kernel, and synthesis report.
Development
pip install -e ".[dev]"
ruff check src tests evals
mypy src
pytest tests/unit -v
pytest tests/integration -v -m integration # requires CUDA + ANTHROPIC_API_KEY
CI:
- PR workflow (.github/workflows/pr.yml) — unit tests, ruff, mypy on every push/PR.
- Nightly workflow (.github/workflows/nightly.yml) — full integration suite + eval on self-hosted A100, daily cron.
- Pre-release workflow (.github/workflows/eval.yml) — manual trigger, gates v1.0 release.
License
MIT. See LICENSE.
Acknowledgements
Built on top of Anthropic's Claude API, PyTorch's torch.utils.cpp_extension, NVIDIA's CUDA toolkit and Nsight Compute. Internal regression kernels draw inspiration from KernelBench.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cuda_engine-1.0.0.tar.gz.
File metadata
- Download URL: cuda_engine-1.0.0.tar.gz
- Upload date:
- Size: 201.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30a2a750fbeffea511530744b26c3fbdc651dbd7fe1fdc84194c7ef95559eb33
|
|
| MD5 |
aa538a5d0dc6edb81a80c1b889e0c5e9
|
|
| BLAKE2b-256 |
366e7fb35956d5877d4908c16f152e65bb3a70fb468f11902489c7058f9f8271
|
File details
Details for the file cuda_engine-1.0.0-py3-none-any.whl.
File metadata
- Download URL: cuda_engine-1.0.0-py3-none-any.whl
- Upload date:
- Size: 51.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3bb43c7ebf8136a5616067786b7a3d89ec999d85c176ff8cb2a5a0e0567f1aa5
|
|
| MD5 |
728ba88a19d9d51702c34ecf06928df7
|
|
| BLAKE2b-256 |
133b9a325f20d4be6a1e142dcfa06b97dcf489e8d1536e6361357cf21c65be6e
|