Skip to main content

Plain-English -> verified CUDA kernels via Claude-driven agent loop

Project description

cuda-engine

Plain English + a slow PyTorch reference → a verified, benchmarked, annotated CUDA kernel.

cuda-engine is a Python library and CLI that turns a natural-language description and a reference PyTorch function into a CUDA kernel that compiles, matches the reference within tolerance on a real GPU, and benchmarks against torch.compile. It uses Claude (Anthropic) for a 5-stage agent loop (interview → codegen → correctness → performance → polish) with Nsight-driven perf repair and Sonnet→Opus escalation when budgets bust.

Status: pre-1.0. Implementation is feature-complete through M3; v1.0 release gate (M4) is pending. Internal regression suite, eval runner, nightly CI, and torch.compile baseline measurement are all in place. See docs/milestones/M3-evidence.md for the most recent eval results.


What it does

import torch
from cuda_engine import synthesize

def rms_norm(x):
    return x * (x.float().pow(2).mean(dim=-1, keepdim=True) + 1e-5).rsqrt().to(x.dtype)

result = synthesize(
    prompt="Generate a fp16 RMSNorm kernel without gamma over the last dimension.",
    reference=rms_norm,
    target="sm_80",
)

assert result.passed
assert result.correctness.passed                            # verified vs the reference
assert result.performance.below_target is False             # ≥1.0× torch.compile
print(f"Speedup: {result.performance.speedup_vs_torch_compile:.2f}x")
print(f"Kernel: {result.artifacts_dir}/stage5_polish/final/kernel.cu")

Each synthesize() call produces a run directory under ~/.cache/cuda_engine/runs/<run_id>/ containing every prompt sent, every LLM response, every kernel attempt, the final kernel source, the compiled shared object, and the full synthesis trace.


Quickstart

Install

Requires Python 3.11+, CUDA 12.x toolchain (nvcc), PyTorch 2.4+, and an A100-class GPU for end-to-end runs.

pip install cuda-engine    # post-v1.0 release
# or, from source:
git clone https://github.com/shivnarainms22/Cuda-Engine.git
cd Cuda-Engine
pip install -e ".[dev]"

Set your Anthropic key:

export ANTHROPIC_API_KEY=sk-ant-...

CLI

# Synthesize a single kernel
cuda-engine synthesize \
    --prompt "Generate a fp16 RMSNorm kernel without gamma over the last dimension." \
    --reference path/to/rms_norm.py \
    --target sm_80

# Inspect a previous run
cuda-engine inspect <run_id>

# Run the internal eval suite (30 kernels)
cuda-engine eval --suite internal --out evals/results/2026-05-12 --resume

path/to/rms_norm.py should define either a top-level REFERENCE variable or a top-level reference() function.

Library

from cuda_engine import SynthesisConfig, synthesize
from cuda_engine.config import RetryBudgets

result = synthesize(
    prompt="...",
    reference=my_pytorch_fn,
    target="sm_80",
    config=SynthesisConfig(
        retry_budgets=RetryBudgets(codegen=3, performance=2),
        escalate_to_opus_on_bust=True,
        perf_target_speedup_vs_torch_compile=1.0,
    ),
)

See docs/cost.md for tuning retry budgets to bound API spend.


How it works

   prompt + reference.py
            │
            ▼
   ┌─────────────────────┐
   │  Stage 1: Interview │ → KernelSpec (frozen contract)
   └─────────────────────┘
            │
            ▼
   ┌─────────────────────┐
   │  Stage 2: Codegen   │ → kernel.cu + compile.log (hard retry budget)
   └─────────────────────┘
            │
            ▼
   ┌─────────────────────┐    fail → repair via Stage 2
   │  Stage 3: Correct.  │ ──────────┐
   │  HARD GATE          │           │
   └─────────────────────┘           │
            │ pass                   ▼
            ▼                  (loop until pass or budget exhausted)
   ┌─────────────────────┐
   │  Stage 4: Perf      │ → benchmark vs torch.compile
   │  SOFT GATE          │    Nsight-driven repair loop
   │                     │    Sonnet → Opus escalation
   └─────────────────────┘
            │
            ▼
   ┌─────────────────────┐
   │  Stage 5: Polish    │ → annotated kernel.cu (re-verified)
   └─────────────────────┘
            │
            ▼
   SynthesisResult + run_dir
  • Hard gate (Stage 3): kernels that don't match the reference within tolerance fail outright. No exceptions.
  • Soft gate (Stage 4): kernels below the perf target still ship, but with below_target=True and a warning. Stage 4 burns its retry budget on Nsight-driven optimizations, then optionally escalates to Opus.
  • Subprocess isolation: all GPU work happens in a subprocess child. Crashes in user kernels (segfaults, illegal memory access, OOM) don't take down the orchestrator.

Design document: docs/superpowers/specs/2026-04-26-cuda-synthesis-engine-design.md.


Eval results

The internal regression suite has 30 hand-curated kernels covering elementwise ops, reductions, and simple fused kernels. All speedups are measured on an A100 (sm_80) against the fastest torch.compile mode (best of default / max-autotune / reduce-overhead) at N≈16M, so a win means beating torch.compile at its best.

Internal suite — 30/30, A100, 2026-06-01 (M3-evidence.md):

  • Functional pass rate: 30/30 (100%).
  • Median speedup vs torch.compile: 1.04×; p25: 1.00×.
  • fast_1: 24/30 (80%) kernels strictly faster than torch.compile.
  • Biggest wins: topk_fp32 12.5× (inductor falls back to a slow sort), masked_mean 2.6×, cumulative_max 1.45×, softmax_lastdim 1.33×. As expected for bandwidth-bound elementwise ops, those sit at parity (torch.compile is already at the HBM roofline); the wins come from reductions/scans.

KernelBench external subset (12 unseen, in-scope level1 ops): 9/9 functional on the kernels run so far (remaining 3 pending a credit top-up).

An earlier baseline bug measured against torch.compile's slowest mode (reduce-overhead) at too-small N, which inflated speedups (one kernel read 9.7× when the honest number is ~parity). Fixed in commit 21f3b2b; all numbers above use the corrected best-mode baseline.


Scope

In scope for v1

  • Kernel categories: elementwise + simple fused (RMSNorm, layernorm, GELU/SiLU/sigmoid variants, GLU/SwiGLU/GEGLU fusions, dropout-fused) and reductions/scans (sum, mean, argmax, top-k, prefix-sum, masked-mean).
  • Targets: codegen for sm_80 / sm_90 / sm_100; runtime verification on sm_80 only.
  • LLM: Anthropic Claude Sonnet 4.6 default, Opus 4.7 escalation. Prompt caching enabled.
  • Eval suites: 30-kernel internal regression + filtered KernelBench subset.

Out of scope for v1

  • GEMM, matmul, attention kernels (CUTLASS and FlashAttention dominate; deferred to v2/v3).
  • Multi-GPU, multi-node, rack-scale orchestration.
  • Formal verification (SMT race-freedom proofs).
  • Cross-LLM-provider support (Anthropic-only behind a single seam).
  • Backward-pass kernel synthesis, autograd custom ops.
  • VS Code / IDE integrations.

Cost

Per-kernel envelope under default config:

Scenario USD
Happy path ~$0.10–0.20
Typical with retries ~$0.15–0.40
Hard kernel ~$0.30–0.80
With Opus escalation ~$0.80–2.00

Full eval suite (30 kernels): ~$5–20 depending on retries. See docs/cost.md for the per-stage breakdown and the four config knobs to bound spend.


Privacy

cuda-engine writes full LLM transcripts and reference source code to ~/.cache/cuda_engine/runs/<run_id>/. No telemetry, no third-party logging. All network traffic is to api.anthropic.com over TLS. See docs/privacy.md for how to keep proprietary references out of artifact directories.


Examples


Development

pip install -e ".[dev]"
ruff check src tests evals
mypy src
pytest tests/unit -v
pytest tests/integration -v -m integration   # requires CUDA + ANTHROPIC_API_KEY

CI:


License

MIT. See LICENSE.


Acknowledgements

Built on top of Anthropic's Claude API, PyTorch's torch.utils.cpp_extension, NVIDIA's CUDA toolkit and Nsight Compute. Internal regression kernels draw inspiration from KernelBench.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cuda_engine-1.0.0.tar.gz (201.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cuda_engine-1.0.0-py3-none-any.whl (51.8 kB view details)

Uploaded Python 3

File details

Details for the file cuda_engine-1.0.0.tar.gz.

File metadata

  • Download URL: cuda_engine-1.0.0.tar.gz
  • Upload date:
  • Size: 201.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cuda_engine-1.0.0.tar.gz
Algorithm Hash digest
SHA256 30a2a750fbeffea511530744b26c3fbdc651dbd7fe1fdc84194c7ef95559eb33
MD5 aa538a5d0dc6edb81a80c1b889e0c5e9
BLAKE2b-256 366e7fb35956d5877d4908c16f152e65bb3a70fb468f11902489c7058f9f8271

See more details on using hashes here.

File details

Details for the file cuda_engine-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: cuda_engine-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 51.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cuda_engine-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3bb43c7ebf8136a5616067786b7a3d89ec999d85c176ff8cb2a5a0e0567f1aa5
MD5 728ba88a19d9d51702c34ecf06928df7
BLAKE2b-256 133b9a325f20d4be6a1e142dcfa06b97dcf489e8d1536e6361357cf21c65be6e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page