High-performance, fused GPU kernels for PyTorch training workloads
Project description
Greyhound
High-performance fused GPU kernels for PyTorch training workloads. Greyhound
implements standalone CuTe DSL kernels, wraps them as PyTorch custom ops, and
exposes them through greyhound.nn.functional plus a small set of nn.Module
wrappers.
Status: Alpha (
0.1.x). APIs and performance tradeoffs may change between releases.
What Is Included
| Operation | Functional API | Module API | Notes |
|---|---|---|---|
| Cross-entropy | greyhound.nn.functional.cross_entropy |
- | Standalone cross-entropy loss backed by a loss-and-logits-gradient kernel. |
| Chunked linear loss | greyhound.nn.functional.chunked_linear_loss |
- | Chunks the final projection for user-provided losses; chunked_linear_cross_entropy adds a fused cross-entropy-and-logits-gradient kernel per logits tile. |
| Causal Conv1D | greyhound.nn.functional.causal_conv1d |
GreyhoundCausalConv1d |
Depthwise causal short convolution, optional SiLU. |
| Selective log-softmax | greyhound.nn.functional.selective_log_softmax |
- | Gathers one log-probability per row without materializing full log-probs. |
Operations are built in three layers:
greyhound.kernels: raw CuTe DSL kernels.greyhound.ops:torch.library.custom_opand autograd integration.greyhound.nn/greyhound.nn.functional: user-facing modules and functions.
Bonus Integrations
greyhound.bonus contains experimental utilities that compose
kernels from other providers. These are useful for trying higher-level training
building blocks without treating them as core Greyhound kernels.
| Utility | API | Provider dependency | Notes |
|---|---|---|---|
| Newton-Schulz orthogonalization | greyhound.bonus.newton_schultz.orthogonalize_via_newton_schulz |
quack-kernels |
Muon-style zeropower iteration using Quack symmetric GEMM when available. |
Requirements
- Python 3.11+
- PyTorch 2.7.1+
- CUDA-capable NVIDIA GPU
nvidia-cutlass-dsl[cu13]on non-macOS platforms
Some benchmarks and bonus paths compare against optional third-party providers:
liger-kernel, quack-kernels, flash-linear-attention, causal-conv1d,
gram-newton-schulz, and dion.
Installation
From PyPI:
pip install greyhound-kernels
From source:
git clone https://github.com/tyler-romero/greyhound.git
cd greyhound
pip install -e ".[dev]"
With the Quack bonus and benchmark provider:
pip install -e ".[dev,quack]"
With the full local benchmark provider set:
uv sync --extra dev --group thirdparty
With Modal benchmark tooling:
pip install -e ".[dev,modal]"
Quick Start
Functional Ops
import torch
from greyhound.nn.functional import (
autograd_loss_and_logits_grad,
causal_conv1d,
cross_entropy,
chunked_linear_cross_entropy,
chunked_linear_loss,
selective_log_softmax,
)
device = "cuda"
hidden = torch.randn(4096, 4096, device=device, dtype=torch.bfloat16)
lm_head = torch.randn(128256, 4096, device=device, dtype=torch.bfloat16)
targets = torch.randint(0, 128256, (4096,), device=device)
loss = chunked_linear_cross_entropy(hidden, lm_head, targets)
regression_target = torch.randn(4096, device=device, dtype=torch.float32)
def mse_chunk(logits, target):
prediction = logits.float().mean(dim=-1)
return torch.nn.functional.mse_loss(
prediction, target, reduction="sum"
)
mse_loss_and_grad = autograd_loss_and_logits_grad(mse_chunk)
custom_loss = chunked_linear_loss(hidden, lm_head, mse_loss_and_grad, regression_target)
logits = torch.randn(16, 128256, device=device, dtype=torch.bfloat16)
labels = torch.randint(0, 128256, (16,), device=device)
ce_loss = cross_entropy(logits, labels)
index = torch.randint(0, 128256, (16,), device=device)
selected = selective_log_softmax(logits, index)
conv_x = torch.randn(4, 1024, 2048, device=device, dtype=torch.bfloat16)
conv_weight = torch.randn(1024, 4, device=device, dtype=torch.bfloat16)
conv_out = causal_conv1d(conv_x, conv_weight, activation="silu")
Module Wrappers
import torch
from greyhound.nn import GreyhoundCausalConv1d
device = "cuda"
conv = GreyhoundCausalConv1d(
channels=1024,
kernel_size=4,
activation="silu",
device=device,
dtype=torch.bfloat16,
)
Benchmarks
Benchmark entry points live in src/benchmarks:
causal_conv1d_bench.pycross_entropy_with_grad_bench.pychunked_linear_cross_entropy_bench.pylogprobs_bench.pynewton_schulz_bench.py
Run one locally:
uv run python src/benchmarks/chunked_linear_cross_entropy_bench.py --mode full
Run the default benchmark set on Modal and merge results back into the local CSV:
uv run --extra modal python scripts/run_modal_benchmarks.py --gpu H100
Regenerate plots from CSV data:
uv run python src/benchmarks/plot_from_csv.py
Development
Use uv for local development:
uv sync --extra dev
Common checks:
ruff format .
ruff check .
ty check .
pytest -v --color=yes --doctest-modules src/tests/ src/greyhound/
The project is intentionally typed as a Python package (py.typed). CuTe DSL
kernel files are excluded from ty because their DSL constructs do not map cleanly
to Python static typing.
Documentation
The docs site includes installation notes, API docs, per-kernel pages, and benchmark visualizations:
https://tyler-romero.github.io/greyhound/
Build docs locally:
make docs-build
License
Apache 2.0. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file greyhound_kernels-0.1.0.tar.gz.
File metadata
- Download URL: greyhound_kernels-0.1.0.tar.gz
- Upload date:
- Size: 25.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85b79609ec38dce1e357d4dba1c5cd994d0b89fe804e298e1090726e5d650528
|
|
| MD5 |
94bc74d0630faae15ffa835d527101f7
|
|
| BLAKE2b-256 |
69b00fad5d8746579db2376daac21b1a1cabedd13ce19fd121378dae87e0f4cb
|
File details
Details for the file greyhound_kernels-0.1.0-py3-none-any.whl.
File metadata
- Download URL: greyhound_kernels-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28db1e68d5c557494281fa65010897ea0ae0fa9fc62813c31ada3d74d4f190a2
|
|
| MD5 |
fdb1374538e9bfa2c3efe821b1ea04f9
|
|
| BLAKE2b-256 |
ed4ae7f2674357f2fd010f6de4d639a5764de0da7375680fb04069a82c6aa6ab
|