FLIR - ROCm Domain Specific Language for layout algebra (Python + embedded MLIR runtime)
Project description
FlyDSL (Flexible layout python DSL)
A Python DSL and a MLIR stack for authoring high‑performance GPU kernels with explicit layouts and tiling.
FlyDSL is the Python front‑end of the project: a Flexible Layout Python DSL for expressing tiling, partitioning, data movement, and kernel structure at a high level.
FlyDSL: FlyDSL is powered by FLIR (Flexible Layout Intermediate Representation):
an end‑to‑end, MLIR‑native compiler stack for GPU kernels. Its core is the flir dialect—a first‑class
layout IR with explicit algebra and coordinate mapping, plus a composable lowering pipeline to GPU/ROCDL.
Overview
- FlyDSL (Python DSL): author kernels in Python and compile them through FLIR
- Primary package:
flydsl/(flydsl/src/flydsl/) - Kernel examples:
kernels/(importable askernels.*)
- Primary package:
- FLIR (
flirdialect): the layout IR and compiler foundation- Core abstractions:
!flir.shape,!flir.stride,!flir.layout,!flir.coord - Algebra ops: composition/product/divide/partition + coordinate mapping ops
- Tooling:
flir-optfor pass testing and IR experimentation
- Core abstractions:
- Embedded MLIR Python runtime (
_mlir)- No external
mlirpython wheel is required: MLIR python bindings are built and staged into.flir/build/python_packages/flydsl/_mlir(default; legacybuild/also works) - Python package root:
.flir/build/python_packages/flydsl/
- No external
Repository layout
FlyDSL/
├── scripts/ # helper scripts (build llvm, tests, packaging)
├── flir/ # C++ sources + build scripts (CMake, embedded python bindings)
│ ├── CMakeLists.txt
│ ├── build.sh # build FLIR + python bindings (recommended)
│ ├── include/flir/ # dialect headers + TableGen definitions
│ ├── lib/ # dialect implementation (Dialect/, Transforms/)
│ ├── python_bindings/ # MLIR python bindings + runtime wrappers
│ └── tools/flir-opt/ # flir-opt CLI tool
├── flydsl/ # Python sources (src/flydsl) + python-only docs/reqs
├── tests/ # mlir + python tests/benchmarks
│ ├── mlir/ # MLIR file tests
│ ├── pyir/ # Python IR tests (no GPU required)
│ └── kernels/ # GPU execution tests
└── kernels/ # Python kernels (importable as `kernels.*`)
Getting started
- ROCm: required for GPU execution tests/benchmarks (IR-only tests do not need a GPU).
- Build tools:
cmake, C++ compiler, and optionallyninja(faster). - Python: Python 3 +
pip.scripts/build_llvm.shinstallsnanobind,numpy,pybind11.flydsl/requirements.txtexists for auxiliary deps (numpy, ) for runtime data initialize and result check.
Build
A) Build / use an existing llvm-project (MLIR)
If you already have an MLIR build, set:
export MLIR_PATH=/path/to/llvm-project/build
Or use the helper script (clones ROCm llvm-project and builds MLIR):
bash scripts/build_llvm.sh
B) Build FLIR (C++ + embedded python package)
./flir/build.sh
After a successful build, you will have:
.flir/build/bin/flir-opt(default; legacybuild/bin/flir-optalso works)- Python package root at:
.flir/build/python_packages/flydsl/- This contains:
flydsl/(your Python API)_mlir/(embedded MLIR python bindings)
Python install
python3 -m pip install -e .
#for development, you can also use:
python setup.py develop
Build a wheel (default output under dist/):
python3 setup.py bdist_wheel
ls dist/
Run tests
bash scripts/run_tests.sh
What run_tests.sh does (high level):
- MLIR file tests: runs
tests/mlir/*.mlirthroughflir-opt --flir-to-standard - Python IR tests: runs
tests/pyir/test_*.py(no GPU required) - Kernel/GPU execution tests (only if ROCm is detected): runs
tests/kernels/test_*.py
For the test folder organization, see tests/ (mlir/, pyir/, kernels/).
Troubleshooting
-
flir-opt not found- Run
./flir/build.sh, or build it explicitly:cmake --build build --target flir-opt -j$(nproc)
- Run
-
Python import issues (
No module named flydsl/No module named mlir)- Ensure you are using the embedded package:
export PYTHONPATH=$(pwd)/build/python_packages/flydsl:$PYTHONPATH
- Or prefer in-tree sources:
export PYTHONPATH=$(pwd)/flydsl/src:$(pwd)/.flir/build/python_packages/flydsl:$PYTHONPATH
- Ensure you are using the embedded package:
-
MLIR
.soload errors- Add MLIR build lib dir to the loader path:
export LD_LIBRARY_PATH=$MLIR_PATH/lib:$LD_LIBRARY_PATH
- Add MLIR build lib dir to the loader path:
📐 FLIR Layout System
FLIR = Flexible Layout Intermediate Representation.
FLIR introduces a layout system to express complex data mapping patterns on GPUs (tiling, swizzling, vectorization).
Core Abstractions
- Shape: The extent of dimensions (e.g.,
(M, N)). - Stride: The distance between elements in memory (e.g.,
(1, M)for column-major). - Layout: A pair of
(Shape, Stride)that maps a logical Coordinate to a physical linear Index.
Formula: Index = dot(Coord, Stride) = sum(c_i * s_i)
Operations
- Construction:
make_shape,make_stride,make_layout,make_coord - Mapping:
crd2idx(coord, layout) -> index: Convert logical coordinate to physical index.idx2crd(index, layout) -> coord: Convert physical index to logical coordinate.
- Inspection:
size,cosize,rank - Algebra:
composition(A, B): Compose layouts (A ∘ B).product(A, B): Combine layouts (Logical, Tiled, Blocked, etc.).divide(A, B): Partition layout A by B (Logical, Tiled, etc.).local_partition(layout, tile, index): Slice layout for a specific thread/block.
Example (MLIR)
func.func @layout_example(%i: index, %j: index) -> index {
%c8 = arith.constant 8 : index
%c16 = arith.constant 16 : index
%c1 = arith.constant 1 : index
// Create 2D layout (8, 16) with column-major stride (1, 8)
%shape = flir.make_shape %c8, %c16 : (index, index) -> !flir.shape<(8,16)>
%stride = flir.make_stride %c1, %c8 : (index, index) -> !flir.stride<(1,8)>
%layout = flir.make_layout %shape, %stride : (!flir.shape<(8,16)>, !flir.stride<(1,8)>) -> !flir.layout<(8,16):(1,8)>
// Convert coordinate (i, j) to linear index
%coord = flir.make_coord %i, %j : (index, index) -> !flir.coord<(?,?)>
%idx = flir.crd2idx %coord, %layout : (!flir.coord<(?,?)>, !flir.layout<(8,16):(1,8)>) -> index
return %idx : index
}
🐍 Python API (flydsl)
Python package:
flydsl(C++/dialect namespace:flir).
FLIR provides a high-level Python API for generating kernels.
Layout Construction
from flydsl.dialects.ext import flir
class _LayoutExample(flir.MlirModule):
@flir.jit
def layout_ops(self: flir.T.i64):
# Create Layout (8x16, column-major)
shape = flir.make_shape(8, 16)
stride = flir.make_stride(1, 8)
layout = flir.make_layout(shape, stride)
# Query layout properties
total_size = flir.size(shape)
layout_rank = flir.rank(layout)
return total_size
Pipeline API
Easy-to-use compilation pipeline:
from flydsl.compiler.pipeline import Pipeline
# Build and run optimization pipeline
pipeline = (
Pipeline()
.flir_to_standard()
.canonicalize()
.cse()
.rocdl_attach_target(chip="gfx942")
# convert-gpu-to-rocdl must run under gpu.module
.Gpu(Pipeline().convert_gpu_to_rocdl(runtime="HIP"))
.gpu_to_llvm()
.lower_to_llvm()
.gpu_module_to_binary(format="bin")
)
binary_module = pipeline.run(module)
⚙️ Hierarchical Kernel Control
FLIR keeps the tiling hierarchy explicit across block, warp, thread, and instruction scopes:
# Define thread and value layouts
thr_layout = flir.make_ordered_layout((THR_M, THR_N), order=(1, 0))
val_layout = flir.make_ordered_layout((VAL_M, VAL_N), order=(1, 0))
# Create tiled copy with vectorized atoms
copy_atom = flir.make_copy_atom(T.f32(), vector_size=8)
tiled = flir.make_tiled_copy_tv(copy_atom, thr_layout, val_layout,
thr_shape=(THR_M, THR_N), val_shape=(VAL_M, VAL_N))
# Partition tensor across blocks and threads
tensor_A = flir.make_tensor(A, shape=(M, N), strides=(N, 1))
tiles = flir.zipped_divide(tensor_A, (THR_M * VAL_M, THR_N * VAL_N))
blk_tile = tiles[(flir.block_idx("y"), flir.block_idx("x"))]
thr_tile = tiled.get_slice(tid_linear).partition_S(blk_tile)
With per-level partitions, you can allocate register fragments, emit predicate masks, and schedule MFMA/vector instructions while retaining full knowledge of the execution hierarchy.
🧮 Minimal VecAdd Example
This condensed snippet mirrors tests/kernels/test_vec_add.py, showing how to define GPU kernels with tiled copies and fragments:
import flydsl
from flydsl.dialects.ext import flir
import _mlir.extras.types as T
THREADS, TILE, VEC = 256, 8, 4
class VecAddKernel(flir.MlirModule):
GPU_MODULE_NAME = "vec_kernels"
GPU_MODULE_TARGETS = ['#rocdl.target<chip = "gfx942">']
@flir.kernel
def vec_add(self: flir.T.i64,
A: lambda: T.memref(T.dynamic(), T.f32()),
B: lambda: T.memref(T.dynamic(), T.f32()),
C: lambda: T.memref(T.dynamic(), T.f32()),
n: lambda: T.index()):
tid = flir.thread_idx("x")
bid = flir.block_idx("x")
# Define thread/value layouts for tiled copy
thr_layout = flir.make_ordered_layout((THREADS,), order=(0,))
val_layout = flir.make_ordered_layout((TILE,), order=(0,))
copy_atom = flir.make_copy_atom(T.f32(), vector_size=VEC)
tiled = flir.make_tiled_copy_tv(copy_atom, thr_layout, val_layout,
thr_shape=(THREADS,), val_shape=(TILE,))
# Partition tensors across blocks and threads
tensor_A = flir.make_tensor(A, shape=(n,), strides=(1,))
tiles_A = flir.zipped_divide(tensor_A, (THREADS * TILE,))
blkA = tiles_A[(bid,)]
thrA = tiled.get_slice(tid).partition_S(blkA)
# Load to registers, compute, store
frgA = flir.make_fragment_like(thrA, T.f32())
flir.copy(tiled, thrA, frgA)
# ... repeat for B/C, add, store results
# Compile and run
module = VecAddKernel().module
exe = flydsl.compile(module)
exe(a_dev, b_dev, c_dev, size)
See tests/kernels/test_vec_add.py for the complete implementation with benchmarking.
✅ Testing Status
| Category | Status | Description |
|---|---|---|
| MLIR Core | ✅ Passing | Type parsing, Op verification, Basic transforms |
| Flir Ops | ✅ Passing | Layout algebra, Coordinate lowering |
| GPU Backend | ✅ Passing | GPU kernel compilation, Shared memory, Vectorization |
| Hardware | ✅ Passing | MFMA (Matrix Fused Multiply-Add) execution on MI300-family GPUs |
Verified Platforms:
- AMD MI300X/MI308X (gfx942), AMD MI350 (gfx950)
- Linux / ROCm 6.x, 7.x
🙏 Acknowledgements
FLIR's design is inspired by ideas from several projects:
- Categorical Foundations for CuTe Layouts – mathematical framework for layout algebra (companion code)
- NVIDIA CUTLASS – CuTe layout algebra concepts (BSD-3-Clause parts only; no EULA-licensed code was referenced)
- ROCm Composable Kernel – tile-based kernel design patterns for AMD GPUs
- Triton – Python DSL for GPU kernel authoring
📄 License
Apache License 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flydsl-0.0.1.dev95158637-cp312-cp312-manylinux_2_35_x86_64.whl.
File metadata
- Download URL: flydsl-0.0.1.dev95158637-cp312-cp312-manylinux_2_35_x86_64.whl
- Upload date:
- Size: 72.3 MB
- Tags: CPython 3.12, manylinux: glibc 2.35+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9f6bb2cd7ec23a995103aa1380f997952d19e56b98de0e39f9f0bb8c0b53962
|
|
| MD5 |
000b43bb7f3da2db74cf5a271ecd2bc4
|
|
| BLAKE2b-256 |
dbd64004356982a6ba34fd6c6bdbad93468b6b578387120d52b9c0d04ff47104
|
File details
Details for the file flydsl-0.0.1.dev95158637-cp310-cp310-manylinux_2_35_x86_64.whl.
File metadata
- Download URL: flydsl-0.0.1.dev95158637-cp310-cp310-manylinux_2_35_x86_64.whl
- Upload date:
- Size: 72.3 MB
- Tags: CPython 3.10, manylinux: glibc 2.35+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8186ab87d88383118ef7cc88cdf3f265186c6a4fbf6077c530a8e7a6d528278f
|
|
| MD5 |
5f656fee954255c25aa3f352d700deda
|
|
| BLAKE2b-256 |
ffcf39b78ea119ddffdf4fc9d9843cf7b9708af76722400c0a2606bedf5863b6
|