High-level discover/compile/execute API for CUTLASS Python kernels.

These details have not been verified by PyPI

Project links

Project description

CUTLASS Operator API

[!NOTE] CUTLASS Operator API is currently in beta. All interfaces, names, and paths are subject to change.

CUTLASS Operator API is a Python interface for integrating kernels written in CUTLASS Python DSLs (like CuTe DSL) into your code.

While DSLs focus on kernel authoring, CUTLASS Operator API focuses on ease of managing and integrating those kernels into libraries that use CUTLASS.

It views kernels as end-to-end "Operators" that execute an operation (like a GEMM), and provides two things:

a kernel-agnostic interface for finding operators that support an operation/operands, inspecting their properties, and executing them — the same way regardless of which kernel each operator wraps.
a registry of officially maintained CUTLASS kernels exposed through that interface.

Example

import cutlass.operators as ops
import torch

A, B, out = (torch.randn(128, 128, device="cuda", dtype=torch.float16) for _ in range(3))

# Arguments express an operation, and the operands to it
args = ops.GemmArguments(A, B, out, accumulator_type=torch.float32)

# Find operators that support our provided arguments, and can run on SM100.
# Returns a list of ``Operator``s that wrap ready-to-compile CuTe DSL kernels.
operators = ops.get_operators(args, target_sm="100")

# JIT compile and execute one of the returned operators using our arguments
operators[0].run(args)

Why use it?

Any software that uses kernels requires finding kernels that do what you want, wiring up glue code to call them, and maintaining the integration as kernels evolve. Without an integration layer, these tasks are manual, error-prone, and repeated for every kernel. CUTLASS Operator API eases each of these tasks — as kernels rapidly evolve, you don't have to choose between integration churn and adoption inertia.

"Which kernel does what I want?"

Without Operator API, finding the right kernel means reading kernel source code and manually deducing support — which dtypes, layouts, tile sizes, arch features, etc. each kernel supports. Operator API provides a simple get_operators(args): express the operation you want to run, and get all the operators that support it. Each operator also exposes uniform metadata describing its constraints and design features for more advanced inspection, instead of requiring you to deduce from source code.

"How do I get newer, faster, or fixed kernels?"

Adopting kernels directly carries a maintenance burden to mirror bug fixes/optimizations into local copies and monitor release notes for new relevant kernels. With Operator API, you integrate once, not perpetually. New kernels, fixes, and optimizations land in the registry on each release, and you get them automatically without changing your integration code — just upgrade nvidia-cutlass-operators.

"How do I call this kernel with my torch tensors?"

Different kernels have different usage conventions — direct usage requires kernel-specific glue code to convert your framework tensors to the kernel's expected inputs, set performance options, and run preparation steps. Operator API wraps every kernel with a consistent interface: pass PyTorch (or any DLPack-compatible) tensors directly into GemmArguments and call operator.run(args). That lets you swap operators without touching your call site.

It also supports:

Custom epilogue fusions with ease — pass a plain Python function; Operator API lowers it onto CuTe DSL's Epilogue Fusion Configuration (EFC) framework and fuses it into supported kernels.
Bring-your-own-kernel — register your own CuTe DSL kernels so you can call them through the same interfaces as pre-bundled ones from CUTLASS, with no separate integration path for in-house kernels.
Negligible runtime overhead on top of invoking the underlying kernel directly.

How to use it?

Installation

To use with PyTorch, install the nvidia-cutlass-operators[torch] package:

pip install nvidia-cutlass-operators[torch]

Alternatively, choose which dependencies to install:

# Install only nvidia-cutlass-operators core
pip install nvidia-cutlass-operators

# Install all dependencies to develop, run tests, etc.
pip install nvidia-cutlass-operators[dev]

Next steps

Basic GEMM tutorial — a guided tour of the core interface concepts to run a minimal GEMM.
More guided notebook tutorials in examples/.
Full API reference for cutlass.operators.

Supported and Upcoming Features

CUTLASS Operator API will support a wide range of functionality, configurations, and optimizations, all robustly tested.

Current support:

Kernel coverage
- Dense GEMMs (F32, F16, BF16, INT8) for Blackwell, Hopper, Ampere
  - Preferred and fallback cluster shapes
  - Static and dynamic scheduling
- Block-scaled GEMMs (NVFP4, MXFP4, MXFP8, mixed input precision) for Blackwell
- Grouped GEMM (Contiguous offset) for Blackwell
- Low-latency TGV GEMM for Blackwell
Custom epilogue fusions (e.g. activations, elementwise ops, aux load/store)
CUDA Graph support
Native support for PyTorch and other DLPack tensors
Bring-your-own-kernel

Upcoming support:

Additional GEMM kernel coverage: Sparsity, performance optimizations, grouped GEMM variants, and more
Ahead-of-time compilation
JAX Graph support
nvMatmulHeuristics support

Community & Feedback

We welcome contributions and feedback from the developer community. You can:

File bug reports, feature requests, documentation gaps, and upvote roadmap items on the CUTLASS GitHub Issues page.
Contribute examples, tutorials, or improvements via pull request to the CUTLASS repository.
Ask questions and share ideas in the #cutlass channel on NVIDIA Developer Discord.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jul 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nvidia_cutlass_operators-0.1.0-py3-none-any.whl (400.7 kB view details)

Uploaded Jul 2, 2026 Python 3

File details

Details for the file nvidia_cutlass_operators-0.1.0-py3-none-any.whl.

File metadata

Download URL: nvidia_cutlass_operators-0.1.0-py3-none-any.whl
Upload date: Jul 2, 2026
Size: 400.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for nvidia_cutlass_operators-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e8ae7aee2561de13e1cb6ee66dccf0acbdd6c112ce6a4227f4e118da58c21657`
MD5	`ac2e817ed47e9ca0743756e27282ff81`
BLAKE2b-256	`b306dbb6e459c9d30b36fc4c1d2778e1ea60626f9af907d03eb08d33c633c07b`

See more details on using hashes here.

nvidia-cutlass-operators 0.1.0

Navigation

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Project description

CUTLASS Operator API

Example

Why use it?

How to use it?

Installation

Next steps

Supported and Upcoming Features

Community & Feedback

Project details

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes