Skip to main content

High-level discover/compile/execute API for CUTLASS Python kernels.

Project description

CUTLASS Operator API

[!NOTE] CUTLASS Operator API is currently in beta. All interfaces, names, and paths are subject to change.

CUTLASS Operator API is a Python interface for integrating kernels written in CUTLASS Python DSLs (like CuTe DSL) into your code.

While DSLs focus on kernel authoring, CUTLASS Operator API focuses on ease of managing and integrating those kernels into libraries that use CUTLASS.

It views kernels as end-to-end "Operators" that execute an operation (like a GEMM), and provides two things:

  1. a kernel-agnostic interface for finding operators that support an operation/operands, inspecting their properties, and executing them — the same way regardless of which kernel each operator wraps.
  2. a registry of officially maintained CUTLASS kernels exposed through that interface.

Example

import cutlass.operators as ops
import torch

A, B, out = (torch.randn(128, 128, device="cuda", dtype=torch.float16) for _ in range(3))

# Arguments express an operation, and the operands to it
args = ops.GemmArguments(A, B, out, accumulator_type=torch.float32)

# Find operators that support our provided arguments, and can run on SM100.
# Returns a list of ``Operator``s that wrap ready-to-compile CuTe DSL kernels.
operators = ops.get_operators(args, target_sm="100")

# JIT compile and execute one of the returned operators using our arguments
operators[0].run(args)

Why use it?

Any software that uses kernels requires finding kernels that do what you want, wiring up glue code to call them, and maintaining the integration as kernels evolve. Without an integration layer, these tasks are manual, error-prone, and repeated for every kernel. CUTLASS Operator API eases each of these tasks — as kernels rapidly evolve, you don't have to choose between integration churn and adoption inertia.

"Which kernel does what I want?"

Without Operator API, finding the right kernel means reading kernel source code and manually deducing support — which dtypes, layouts, tile sizes, arch features, etc. each kernel supports. Operator API provides a simple get_operators(args): express the operation you want to run, and get all the operators that support it. Each operator also exposes uniform metadata describing its constraints and design features for more advanced inspection, instead of requiring you to deduce from source code.

"How do I get newer, faster, or fixed kernels?"

Adopting kernels directly carries a maintenance burden to mirror bug fixes/optimizations into local copies and monitor release notes for new relevant kernels. With Operator API, you integrate once, not perpetually. New kernels, fixes, and optimizations land in the registry on each release, and you get them automatically without changing your integration code — just upgrade nvidia-cutlass-operators.

"How do I call this kernel with my torch tensors?"

Different kernels have different usage conventions — direct usage requires kernel-specific glue code to convert your framework tensors to the kernel's expected inputs, set performance options, and run preparation steps. Operator API wraps every kernel with a consistent interface: pass PyTorch (or any DLPack-compatible) tensors directly into GemmArguments and call operator.run(args). That lets you swap operators without touching your call site.

It also supports:

  • Custom epilogue fusions with ease — pass a plain Python function; Operator API lowers it onto CuTe DSL's Epilogue Fusion Configuration (EFC) framework and fuses it into supported kernels.
  • Bring-your-own-kernel — register your own CuTe DSL kernels so you can call them through the same interfaces as pre-bundled ones from CUTLASS, with no separate integration path for in-house kernels.
  • Negligible runtime overhead on top of invoking the underlying kernel directly.

How to use it?

Installation

To use with PyTorch, install the nvidia-cutlass-operators[torch] package:

pip install nvidia-cutlass-operators[torch]

Alternatively, choose which dependencies to install:

# Install only nvidia-cutlass-operators core
pip install nvidia-cutlass-operators

# Install all dependencies to develop, run tests, etc.
pip install nvidia-cutlass-operators[dev]

Next steps

Supported and Upcoming Features

CUTLASS Operator API will support a wide range of functionality, configurations, and optimizations, all robustly tested.

Current support:

  • Kernel coverage

    • Dense GEMMs (F32, F16, BF16, INT8) for Blackwell, Hopper, Ampere
      • Preferred and fallback cluster shapes
      • Static and dynamic scheduling
    • Block-scaled GEMMs (NVFP4, MXFP4, MXFP8, mixed input precision) for Blackwell
    • Grouped GEMM (Contiguous offset) for Blackwell
    • Low-latency TGV GEMM for Blackwell
  • Custom epilogue fusions (e.g. activations, elementwise ops, aux load/store)

  • CUDA Graph support

  • Native support for PyTorch and other DLPack tensors

  • Bring-your-own-kernel

Upcoming support:

  • Additional GEMM kernel coverage: Sparsity, performance optimizations, grouped GEMM variants, and more
  • Ahead-of-time compilation
  • JAX Graph support
  • nvMatmulHeuristics support

Community & Feedback

We welcome contributions and feedback from the developer community. You can:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nvidia_cutlass_operators-0.1.0-py3-none-any.whl (400.7 kB view details)

Uploaded Python 3

File details

Details for the file nvidia_cutlass_operators-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for nvidia_cutlass_operators-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e8ae7aee2561de13e1cb6ee66dccf0acbdd6c112ce6a4227f4e118da58c21657
MD5 ac2e817ed47e9ca0743756e27282ff81
BLAKE2b-256 b306dbb6e459c9d30b36fc4c1d2778e1ea60626f9af907d03eb08d33c633c07b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page