High-level discover/compile/execute API for CUTLASS Python kernels.
Project description
CUTLASS Operator API
[!NOTE] CUTLASS Operator API is currently in beta. All interfaces, names, and paths are subject to change.
CUTLASS Operator API is a Python interface for integrating kernels written in CUTLASS Python DSLs (like CuTe DSL) into your code.
While DSLs focus on kernel authoring, CUTLASS Operator API focuses on ease of managing and integrating those kernels into libraries that use CUTLASS.
It views kernels as end-to-end "Operators" that execute an operation (like a GEMM), and provides two things:
- a kernel-agnostic interface for finding operators that support an operation/operands, inspecting their properties, and executing them — the same way regardless of which kernel each operator wraps.
- a registry of officially maintained CUTLASS kernels exposed through that interface.
Example
import cutlass.operators as ops
import torch
A, B, out = (torch.randn(128, 128, device="cuda", dtype=torch.float16) for _ in range(3))
# Arguments express an operation, and the operands to it
args = ops.GemmArguments(A, B, out, accumulator_type=torch.float32)
# Find operators that support our provided arguments, and can run on SM100.
# Returns a list of ``Operator``s that wrap ready-to-compile CuTe DSL kernels.
operators = ops.get_operators(args, target_sm="100")
# JIT compile and execute one of the returned operators using our arguments
operators[0].run(args)
Why use it?
Any software that uses kernels requires finding kernels that do what you want, wiring up glue code to call them, and maintaining the integration as kernels evolve. Without an integration layer, these tasks are manual, error-prone, and repeated for every kernel. CUTLASS Operator API eases each of these tasks — as kernels rapidly evolve, you don't have to choose between integration churn and adoption inertia.
"Which kernel does what I want?"
Without Operator API, finding the right kernel means reading kernel source code and manually
deducing support — which dtypes, layouts, tile sizes, arch features, etc. each kernel supports. Operator API
provides a simple get_operators(args): express the operation you want to run, and get all the
operators that support it. Each operator also exposes uniform metadata describing
its constraints and design features for more advanced inspection, instead of requiring you to deduce
from source code.
"How do I get newer, faster, or fixed kernels?"
Adopting kernels directly carries a maintenance burden to mirror bug fixes/optimizations into local copies and
monitor release notes for new relevant kernels. With Operator API, you integrate once,
not perpetually. New kernels, fixes, and optimizations land in the registry on each release, and you get
them automatically without changing your integration code — just upgrade nvidia-cutlass-operators.
"How do I call this kernel with my torch tensors?"
Different kernels have different usage conventions — direct usage requires kernel-specific
glue code to convert your framework tensors to the kernel's expected inputs, set performance
options, and run preparation steps. Operator API wraps every kernel with a consistent interface: pass PyTorch (or any
DLPack-compatible) tensors directly into GemmArguments and call operator.run(args). That lets you swap operators without touching your call site.
It also supports:
- Custom epilogue fusions with ease — pass a plain Python function; Operator API lowers it onto CuTe DSL's Epilogue Fusion Configuration (EFC) framework and fuses it into supported kernels.
- Bring-your-own-kernel — register your own CuTe DSL kernels so you can call them through the same interfaces as pre-bundled ones from CUTLASS, with no separate integration path for in-house kernels.
- Negligible runtime overhead on top of invoking the underlying kernel directly.
How to use it?
Installation
To use with PyTorch, install the nvidia-cutlass-operators[torch] package:
pip install nvidia-cutlass-operators[torch]
Alternatively, choose which dependencies to install:
# Install only nvidia-cutlass-operators core
pip install nvidia-cutlass-operators
# Install all dependencies to develop, run tests, etc.
pip install nvidia-cutlass-operators[dev]
Next steps
- Basic GEMM tutorial — a guided tour of the core interface concepts to run a minimal GEMM.
- More guided notebook tutorials in examples/.
- Full API reference for
cutlass.operators.
Supported and Upcoming Features
CUTLASS Operator API will support a wide range of functionality, configurations, and optimizations, all robustly tested.
Current support:
-
Kernel coverage
- Dense GEMMs (F32, F16, BF16, INT8) for Blackwell, Hopper, Ampere
- Preferred and fallback cluster shapes
- Static and dynamic scheduling
- Block-scaled GEMMs (NVFP4, MXFP4, MXFP8, mixed input precision) for Blackwell
- Grouped GEMM (Contiguous offset) for Blackwell
- Low-latency TGV GEMM for Blackwell
- Dense GEMMs (F32, F16, BF16, INT8) for Blackwell, Hopper, Ampere
-
Custom epilogue fusions (e.g. activations, elementwise ops, aux load/store)
-
CUDA Graph support
-
Native support for PyTorch and other DLPack tensors
-
Bring-your-own-kernel
Upcoming support:
- Additional GEMM kernel coverage: Sparsity, performance optimizations, grouped GEMM variants, and more
- Ahead-of-time compilation
- JAX Graph support
- nvMatmulHeuristics support
Community & Feedback
We welcome contributions and feedback from the developer community. You can:
- File bug reports, feature requests, documentation gaps, and upvote roadmap items on the CUTLASS GitHub Issues page.
- Contribute examples, tutorials, or improvements via pull request to the CUTLASS repository.
- Ask questions and share ideas in the #cutlass channel on NVIDIA Developer Discord.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nvidia_cutlass_operators-0.1.0-py3-none-any.whl.
File metadata
- Download URL: nvidia_cutlass_operators-0.1.0-py3-none-any.whl
- Upload date:
- Size: 400.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8ae7aee2561de13e1cb6ee66dccf0acbdd6c112ce6a4227f4e118da58c21657
|
|
| MD5 |
ac2e817ed47e9ca0743756e27282ff81
|
|
| BLAKE2b-256 |
b306dbb6e459c9d30b36fc4c1d2778e1ea60626f9af907d03eb08d33c633c07b
|