Skip to main content

High-performance, fused GPU kernels for PyTorch training workloads

Project description

Greyhound

High-performance fused GPU kernels for PyTorch training workloads. Greyhound implements standalone CuTe DSL kernels, wraps them as PyTorch custom ops, and exposes them through greyhound.nn.functional plus a small set of nn.Module wrappers.

Status: Alpha (0.1.x). APIs and performance tradeoffs may change between releases.

What Is Included

Operation Functional API Module API Notes
Cross-entropy greyhound.nn.functional.cross_entropy - Standalone cross-entropy loss backed by a loss-and-logits-gradient kernel.
Chunked linear loss greyhound.nn.functional.chunked_linear_loss - Chunks the final projection for user-provided losses; chunked_linear_cross_entropy adds a fused cross-entropy-and-logits-gradient kernel per logits tile.
Causal Conv1D greyhound.nn.functional.causal_conv1d GreyhoundCausalConv1d Depthwise causal short convolution, optional SiLU.
Selective log-softmax greyhound.nn.functional.selective_log_softmax - Gathers one log-probability per row without materializing full log-probs.

Operations are built in three layers:

  1. greyhound.kernels: raw CuTe DSL kernels.
  2. greyhound.ops: torch.library.custom_op and autograd integration.
  3. greyhound.nn / greyhound.nn.functional: user-facing modules and functions.

Bonus Integrations

greyhound.bonus contains experimental utilities that compose kernels from other providers. These are useful for trying higher-level training building blocks without treating them as core Greyhound kernels.

Utility API Provider dependency Notes
Newton-Schulz orthogonalization greyhound.bonus.newton_schultz.orthogonalize_via_newton_schulz quack-kernels Muon-style zeropower iteration using Quack symmetric GEMM when available.

Requirements

  • Python 3.11+
  • PyTorch 2.7.1+
  • CUDA-capable NVIDIA GPU
  • nvidia-cutlass-dsl[cu13] on non-macOS platforms

Some benchmarks and bonus paths compare against optional third-party providers: liger-kernel, quack-kernels, flash-linear-attention, causal-conv1d, gram-newton-schulz, and dion.

Installation

From PyPI:

pip install greyhound-kernels

From source:

git clone https://github.com/tyler-romero/greyhound.git
cd greyhound
pip install -e ".[dev]"

With the Quack bonus and benchmark provider:

pip install -e ".[dev,quack]"

With the full local benchmark provider set:

uv sync --extra dev --group thirdparty

With Modal benchmark tooling:

pip install -e ".[dev,modal]"

Quick Start

Functional Ops

import torch
from greyhound.nn.functional import (
    autograd_loss_and_logits_grad,
    causal_conv1d,
    cross_entropy,
    chunked_linear_cross_entropy,
    chunked_linear_loss,
    selective_log_softmax,
)

device = "cuda"

hidden = torch.randn(4096, 4096, device=device, dtype=torch.bfloat16)
lm_head = torch.randn(128256, 4096, device=device, dtype=torch.bfloat16)
targets = torch.randint(0, 128256, (4096,), device=device)
loss = chunked_linear_cross_entropy(hidden, lm_head, targets)

regression_target = torch.randn(4096, device=device, dtype=torch.float32)

def mse_chunk(logits, target):
    prediction = logits.float().mean(dim=-1)
    return torch.nn.functional.mse_loss(
        prediction, target, reduction="sum"
    )

mse_loss_and_grad = autograd_loss_and_logits_grad(mse_chunk)
custom_loss = chunked_linear_loss(hidden, lm_head, mse_loss_and_grad, regression_target)

logits = torch.randn(16, 128256, device=device, dtype=torch.bfloat16)
labels = torch.randint(0, 128256, (16,), device=device)
ce_loss = cross_entropy(logits, labels)

index = torch.randint(0, 128256, (16,), device=device)
selected = selective_log_softmax(logits, index)

conv_x = torch.randn(4, 1024, 2048, device=device, dtype=torch.bfloat16)
conv_weight = torch.randn(1024, 4, device=device, dtype=torch.bfloat16)
conv_out = causal_conv1d(conv_x, conv_weight, activation="silu")

Module Wrappers

import torch
from greyhound.nn import GreyhoundCausalConv1d

device = "cuda"

conv = GreyhoundCausalConv1d(
    channels=1024,
    kernel_size=4,
    activation="silu",
    device=device,
    dtype=torch.bfloat16,
)

Benchmarks

Benchmark entry points live in src/benchmarks:

  • causal_conv1d_bench.py
  • cross_entropy_with_grad_bench.py
  • chunked_linear_cross_entropy_bench.py
  • logprobs_bench.py
  • newton_schulz_bench.py

Run one locally:

uv run python src/benchmarks/chunked_linear_cross_entropy_bench.py --mode full

Run the default benchmark set on Modal and merge results back into the local CSV:

uv run --extra modal python scripts/run_modal_benchmarks.py --gpu H100

Regenerate plots from CSV data:

uv run python src/benchmarks/plot_from_csv.py

Development

Use uv for local development:

uv sync --extra dev

Common checks:

ruff format .
ruff check .
ty check .
pytest -v --color=yes --doctest-modules src/tests/ src/greyhound/

The project is intentionally typed as a Python package (py.typed). CuTe DSL kernel files are excluded from ty because their DSL constructs do not map cleanly to Python static typing.

Documentation

The docs site includes installation notes, API docs, per-kernel pages, and benchmark visualizations:

https://tyler-romero.github.io/greyhound/

Build docs locally:

make docs-build

License

Apache 2.0. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

greyhound_kernels-0.1.0.tar.gz (25.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

greyhound_kernels-0.1.0-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file greyhound_kernels-0.1.0.tar.gz.

File metadata

  • Download URL: greyhound_kernels-0.1.0.tar.gz
  • Upload date:
  • Size: 25.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for greyhound_kernels-0.1.0.tar.gz
Algorithm Hash digest
SHA256 85b79609ec38dce1e357d4dba1c5cd994d0b89fe804e298e1090726e5d650528
MD5 94bc74d0630faae15ffa835d527101f7
BLAKE2b-256 69b00fad5d8746579db2376daac21b1a1cabedd13ce19fd121378dae87e0f4cb

See more details on using hashes here.

File details

Details for the file greyhound_kernels-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: greyhound_kernels-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for greyhound_kernels-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 28db1e68d5c557494281fa65010897ea0ae0fa9fc62813c31ada3d74d4f190a2
MD5 fdb1374538e9bfa2c3efe821b1ea04f9
BLAKE2b-256 ed4ae7f2674357f2fd010f6de4d639a5764de0da7375680fb04069a82c6aa6ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page