An xDSL-based stencil compiler that generates optimized GPU kernels via NVIDIA cuTile

These details have not been verified by PyPI

Project description

cuTile Stencil DSL

A Python stencil compiler built on xDSL that generates optimized GPU kernels via NVIDIA cuTile.

Architecture

                         Three-Dialect Compilation Stack
                         ==============================

  @stencil               cutile_stencil.      stencil.apply {       cutile.kernel {       @ct.kernel
  def heat(u,i,j):         access %u [-1,0]     stencil.access        cutile.slice(...)    def heat_kernel():
    return 0.25*(...)      arith.mulf ...        [-1, 0]               cutile.load(...)       ct.load(...)
                           cutile_stencil.       arith.mulf            cutile.store(...)      ct.store(...)
                             yield %res          stencil.return      cutile.host_program {  def launch_heat():
                                                                       cutile.launch(...)     ct.launch(...)
                                                                    }

  Python source       Dialect 1            Dialect 2              Dialect 3           Python source
  (user writes)    (cutile_stencil)     (xDSL stencil)         (cutile_target)       (generated GPU)
       |                 |                   |                       |                     |
       |  AST parser     | normalize pass    |  analysis passes      | emit_python         |
       +---------------->+----------------->+----+--+--+--+-------->+------------------->--+
                                                 |  |  |  |
                                             footprint |  |
                                               tiling -+  |
                                             temporal ----+
                                             boundary, fusion,
                                             multi-GPU, ...

Module Structure

cutile/
|-- frontend/           @stencil decorator, Python AST parser
|-- dialects/           xDSL dialect definitions
|   |-- cutile_stencil/ Dialect 1: mirrors Python syntax
|   |-- (xdsl.stencil)  Dialect 2: standard MLIR stencil (from xDSL, not ours)
|   |-- cutile_target/  Dialect 3: cuTile device + host IR
|   |-- comm/           Communication ops (halo exchange)
|   |-- timestep/       RK time integration
|   +-- layout/         Data layout types
|-- passes/             IR transformation passes
|   |-- analysis/       Footprint, roofline (read-only)
|   |-- tiling.py       Tile size selection
|   |-- temporal.py     Temporal blocking
|   |-- boundary.py     Boundary conditions
|   |-- decompose.py    Multi-GPU domain split
|   +-- halo.py         Halo exchange insertion
|-- lowering/           IR to code
|   |-- normalize.py    Dialect 1 -> Dialect 2 (xDSL stencil)
|   |-- stencil_to_target.py  Dialect 1 -> Dialect 3
|   +-- target_to_python.py   Dialect 3 -> Python source
|-- runtime/            Execution
|   |-- launcher.py     compile() API
|   |-- pipeline.py     Composable PassManager
|   |-- autotune.py     Empirical GPU autotuning
|   +-- communicator.py P2P / NCCL backends
+-- reference/          CPU NumPy reference

Quick Start

from cutile import stencil, compile

@stencil
def heat(u, i, j):
    return 0.25 * (u[i-1,j] + u[i+1,j] + u[i,j-1] + u[i,j+1])

result = compile(heat)
result.emit_to_file("heat_kernel.py")

The @stencil decorator auto-infers ndim=2 and order=2 from the function body. The compile() function runs the full pass pipeline (analysis, tiling, temporal blocking) and generates a cuTile GPU kernel.

Multi-GPU (one-line change)

result = compile(heat, num_gpus=2)

Compilation Pipeline Example

Here is the IR at every level for a 2D heat stencil:

Level 1 -- Python source (user writes):

@stencil
def heat(u, i, j):
    return 0.25 * (u[i-1,j] + u[i+1,j] + u[i,j-1] + u[i,j+1])

Level 2 -- Dialect 1 (cuTile Stencil Dialect):

cutile_stencil.func @heat(ndim=2, order=2, dtype="float64") {
  %1 = cutile_stencil.access %0 [-1, 0] {"i", "j"} : f64
  %2 = cutile_stencil.access %0 [1, 0] {"i", "j"} : f64
  %3 = cutile_stencil.access %0 [0, -1] {"i", "j"} : f64
  %4 = cutile_stencil.access %0 [0, 1] {"i", "j"} : f64
  %5 = arith.constant 0.25 : f64
  %6 = arith.addf %1, %2 : f64
  %7 = arith.addf %6, %3 : f64
  %8 = arith.addf %7, %4 : f64
  %9 = arith.mulf %5, %8 : f64
  cutile_stencil.yield %9 : f64
}

Level 3 -- Dialect 2 (xDSL Stencil Dialect -- all passes run here):

func.func @heat() -> !stencil.temp<?x?xf64> {
  stencil.apply() {
    %1 = stencil.access %arg [-1, 0] : !stencil.temp<?x?xf64>
    %2 = stencil.access %arg [1, 0]  : !stencil.temp<?x?xf64>
    %3 = stencil.access %arg [0, -1] : !stencil.temp<?x?xf64>
    %4 = stencil.access %arg [0, 1]  : !stencil.temp<?x?xf64>
    %5 = arith.constant 0.25 : f64
    %9 = arith.mulf %5, ... : f64
    stencil.return %9 : f64
  } attributes {halo_widths=[1,1], tile_sizes=[32,32], bound="memory"}
}

Level 4 -- Dialect 3 (cuTile Target IR):

cutile.kernel @heat(tile=[32,32], halo=[1,1]) {
  cutile.bid(0), cutile.bid(1)
  cutile.slice(axis=0, start="HX-1", stop="HX-1+nx")
  cutile.slice(axis=1, start="HY",   stop="HY+ny")    -> u_m1_0
  cutile.load(u_m1_0)
  ...
  cutile.store(out, result)
}
cutile.host_program @launch_heat {
  cutile.launch(heat_kernel, grid, args)
}

Level 5 -- Generated cuTile Python:

@ct.kernel
def heat_kernel(u, output, TX: ConstInt, TY: ConstInt, HX: ConstInt, HY: ConstInt):
    bx, by = ct.bid(0), ct.bid(1)
    u_m1_0 = u.slice(axis=0, start=HX-1, stop=HX-1+nx).slice(axis=1, start=HY, stop=HY+ny)
    t_u_m1_0 = ct.load(u_m1_0, index=(bx, by), shape=(TX, TY))
    ...
    result = 0.25 * (t_u_m1_0 + t_u_p1_0 + t_u_0_m1 + t_u_0_p1)
    ct.store(out, index=(bx, by), tile=result)

def launch_heat(u_in, u_out):
    ct.launch(stream, grid, heat_kernel, (u_in, u_out, TX, TY, HX, HY))

Setup

git clone https://github.com/tavakkoliamirmohammad/cutile-stencil-dsl && cd cutile-stencil-dsl
python -m venv venv && source venv/bin/activate

# CPU only (DSL + analysis + codegen)
pip install -e ".[test]"

# With GPU support
pip install -e ".[gpu,test]"

Tests

python -m pytest tests/ -v

258 tests across 6 test files:

Test file	Tests	What it covers
`test_dialects.py`	118	All 5 xDSL dialects: ops, attrs, printers
`test_cutile_new.py`	82	Frontend, passes, lowering, compile API, reference
`test_all_modes_convergence.py`	24	4 modes x 6 stencils (GPU vs CPU)
`test_cutile_gpu_apps.py`	13	FDTD, Gray-Scott, shallow water (GPU)
`test_lowering.py`	21	Code generation unit tests

Examples

python examples/heat_1d.py          # 1D heat equation
python examples/wave_2d.py          # 2D wave (4th-order)
python examples/laplacian_3d.py     # 3D Laplacian
python examples/gray_scott.py       # Reaction-diffusion (2 fields)
python examples/fdtd_maxwell_1d.py  # FDTD Maxwell
python examples/shallow_water.py    # Shallow water (3 fields)
python examples/advection_upwind.py # Upwind advection
python examples/heat_2d_bricked.py  # Bricked memory layout

Benchmarks

# cuTile only
python run_benchmarks.py

# With autotuning
python run_benchmarks.py --autotune

# Compare against JAX/XLA
python run_benchmarks.py --autotune --jax

# Full sweep (all stencils x all modes x all sizes)
python run_full_benchmarks.py

Dependencies

xDSL (>= 0.62) -- Pure Python MLIR framework
NumPy -- CPU reference
cuda-tile + CuPy -- GPU execution (optional)
JAX -- Benchmark comparison (optional)

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Apr 13, 2026

0.1.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cutile_stencil-0.2.0.tar.gz (83.5 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cutile_stencil-0.2.0-py3-none-any.whl (83.9 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file cutile_stencil-0.2.0.tar.gz.

File metadata

Download URL: cutile_stencil-0.2.0.tar.gz
Upload date: Apr 13, 2026
Size: 83.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cutile_stencil-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`44fbe2eaf1455a55e903f7e5212fe86550edb36a058789e9bec7c4d9bc71b902`
MD5	`4bbcbeb43af2072b64fdadf526be9d32`
BLAKE2b-256	`d8e8bb13373576d16e1f46feb408265180448886251ed5018db037407a94bd9a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cutile_stencil-0.2.0.tar.gz:

Publisher: publish.yml on tavakkoliamirmohammad/cutile-stencil-dsl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cutile_stencil-0.2.0.tar.gz
- Subject digest: 44fbe2eaf1455a55e903f7e5212fe86550edb36a058789e9bec7c4d9bc71b902
- Sigstore transparency entry: 1282914756
- Sigstore integration time: Apr 13, 2026
Source repository:
- Permalink: tavakkoliamirmohammad/cutile-stencil-dsl@25079716076143171ccf4d392e39494ea5887768
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/tavakkoliamirmohammad
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@25079716076143171ccf4d392e39494ea5887768
- Trigger Event: push

File details

Details for the file cutile_stencil-0.2.0-py3-none-any.whl.

File metadata

Download URL: cutile_stencil-0.2.0-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 83.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cutile_stencil-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fcb1d5e63c8a5555dc28ddcc9d344cdc84803404340fc6d7de7e8728a5a5d6e5`
MD5	`769a05c5db11d21095c3a5b1ede0e7d9`
BLAKE2b-256	`8d33c1a8958fdf3605dbb8be5d65e1aede2e0ec14b14aba59205e8de24191a77`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cutile_stencil-0.2.0-py3-none-any.whl:

Publisher: publish.yml on tavakkoliamirmohammad/cutile-stencil-dsl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cutile_stencil-0.2.0-py3-none-any.whl
- Subject digest: fcb1d5e63c8a5555dc28ddcc9d344cdc84803404340fc6d7de7e8728a5a5d6e5
- Sigstore transparency entry: 1282914762
- Sigstore integration time: Apr 13, 2026
Source repository:
- Permalink: tavakkoliamirmohammad/cutile-stencil-dsl@25079716076143171ccf4d392e39494ea5887768
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/tavakkoliamirmohammad
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@25079716076143171ccf4d392e39494ea5887768
- Trigger Event: push

cutile-stencil 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

cuTile Stencil DSL

Architecture

Module Structure

Quick Start

Multi-GPU (one-line change)

Compilation Pipeline Example

Setup

Tests

Examples

Benchmarks

Dependencies

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance