GPU-accelerated neural network operations using Vulkan compute shaders
Project description
Grilly
Deep learning, well done.
GPU-accelerated neural network framework built on Vulkan compute shaders. Runs on any GPU — AMD, NVIDIA, Intel — no CUDA required. Provides a PyTorch-like nn.Module API backed by 161 SPIR-V shaders and a native C++ dispatch layer.
Alpha software. APIs may change between minor versions. We welcome early adopters and feedback.
Howto Guides: howtos/ (self-contained HTML tutorials)
Quick Start
import numpy as np
from grilly import nn
# Define a model — same patterns as PyTorch
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10),
)
# Forward pass
x = np.random.randn(32, 784).astype(np.float32)
logits = model(x)
print(logits.shape) # (32, 10)
# Loss + backward + optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = nn.optim.AdamW(model.parameters(), lr=1e-3)
targets = np.random.randint(0, 10, (32,))
loss = loss_fn(logits, targets)
grad = loss_fn.backward(np.ones_like(loss), logits, targets)
model.zero_grad()
model.backward(grad)
optimizer.step()
Autograd
from grilly import nn
x = nn.Variable(nn.randn(32, 128), requires_grad=True)
layer = nn.Linear(128, 10)
logits = x @ nn.Variable(layer.weight.T) + nn.Variable(layer.bias)
loss = logits.sum()
loss.backward()
print(x.grad.shape) # (32, 128)
Installation
From PyPI
pip install grilly
From Source (with C++ backend)
The C++ backend (grilly_core) is required — it provides the native Vulkan dispatch layer for all GPU operations.
git clone https://github.com/grillcheese-ai/grilly.git
cd grilly
pip install -e ".[dev]"
# Build the C++ backend
cmake -B build -DPYBIND11_FINDPYTHON=ON
cmake --build build --config Release
cp build/Release/grilly_core.*.pyd . # Windows
# cp build/grilly_core.*.so . # Linux
Verify:
python -c "import grilly_core; print('C++ backend OK')"
python -c "import grilly; b = grilly.Compute(); print('GPU:', b.device_name)"
See INSTALL.md for full setup (Vulkan SDK, Ubuntu, CI environments, troubleshooting).
Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| Python | 3.12+ | 3.12 |
| GPU VRAM | 8 GB | 12 GB+ |
| System RAM | 32 GB | 64 GB |
| Vulkan | 1.2+ drivers | Latest drivers |
Supported GPUs: AMD (RX 5000+), NVIDIA (GTX 1060+), Intel (Arc A-series).
Features
PyTorch-like nn.Module API
Standard layers with GPU-accelerated forward and backward passes:
| Category | Modules |
|---|---|
| Linear | Linear, Embedding, Dropout |
| Convolution | Conv1d, Conv2d |
| Recurrent | LSTM, LSTMCell, GRU, GRUCell |
| Pooling | MaxPool2d, AvgPool2d, AdaptiveMaxPool2d, AdaptiveAvgPool2d |
| Normalization | LayerNorm, RMSNorm, BatchNorm1d, BatchNorm2d |
| Activations | ReLU, GELU, SiLU, SwiGLU, GCU, RoSwish, Softmax, Softplus |
| Attention | MultiheadAttention, FlashAttention2, RoPE |
| Loss | MSELoss, CrossEntropyLoss, BCELoss |
| Containers | Sequential, Residual |
Spiking Neural Networks
Full SNN framework with surrogate gradient training:
- Neuron models:
IFNode,LIFNode,ParametricLIFNode - Surrogate gradients:
ATan,Sigmoid,FastSigmoid - Temporal containers:
SeqToANNContainer,MultiStepContainer - Normalization:
BatchNormThroughTime,TemporalEffectiveBatchNorm,NeuNorm - Synapses:
STPSynapse,DualTimescaleSynapse,SynapseFilter - Attention:
SpikingSelfAttention,TemporalWiseAttention,QKAttention - ANN-to-SNN conversion:
Converter,VoltageScaler
Multimodal Fusion
PerceiverIO— Modality-agnostic input compressionPerceiverResampler— Flamingo-style visual token resamplingFlamingoFusion— Cross-attention VLM fusionCrossModalAttentionFusion— Bidirectional cross-modal attentionImageBindFusion— Joint embedding with contrastive lossBottleneckFusion— Multimodal Bottleneck TransformerVisionLanguageModel— Complete VLM with visual conditioning
Transformer Components
- Flash Attention 2 (tiled, O(seq) memory)
- Rotary Position Embeddings (RoPE)
- LoRA fine-tuning (
LoRALinear,LoRAAttention,LoRAModel) - Transformer encoder/decoder layers
- Fused operations: SwiGLU FFN, RMSNorm+Linear, QKV projection
Inference Optimizations
- Fused RMSNorm shader (Llama, Gemma)
- Grouped Query Attention (GQA) decode against KV-cache
- INT8 GEMM (weight-only, FP32 accumulation)
- 4-bit block quantization (per-block scale + zero-point)
Optimizers
AdamW, Adam, SGD, NLMS, NaturalGradient, AutoHypergradientAdamW (OSGM-style auto LR tuning), plus LR schedulers (StepLR, CosineAnnealingLR, ReduceLROnPlateau).
Functional API
Stateless functions mirroring torch.nn.functional:
import grilly.functional as F
F.linear(x, weight, bias)
F.relu(x)
F.softmax(x, dim=-1)
F.cross_entropy(logits, targets)
F.flash_attention2(q, k, v)
Autograd
Full computation graph with automatic differentiation:
from grilly.nn import Variable, no_grad, tensor
x = Variable(tensor([1.0, 2.0, 3.0]), requires_grad=True)
y = (x * x).sum()
y.backward()
print(x.grad) # [2.0, 4.0, 6.0]
C++ Backend (grilly_core)
The native C++ extension (grilly_core) wraps all Vulkan compute dispatch via pybind11. It provides 16 operation modules:
| Op | Description |
|---|---|
linear |
Dense matrix multiply (GEMM) |
conv |
2D convolution (im2col + GEMM) |
activations |
ReLU, GELU, SiLU, Tanh |
layernorm |
Layer normalization |
rmsnorm |
Root mean square normalization |
batchnorm |
Batch normalization (2D) |
attention |
Flash Attention 2 |
attention_ops |
RoPE, KV-cache ops |
embedding |
Token + position embeddings |
pooling |
MaxPool2d, AvgPool2d |
loss |
Cross-entropy, MSE, BCE |
snn |
LIF/IF neuron step kernels |
optimizer |
Adam, AdamW, SGD step kernels |
learning |
STDP, Hebbian, EWC |
kv_cache |
Paged KV-cache management |
swizzle |
Memory layout transforms |
Build instructions: see INSTALL.md.
Ecosystem
| Package | Description |
|---|---|
| optimum-grilly | HuggingFace Optimum backend — from_pretrained → Vulkan inference (Llama, Mistral, BERT, GPT-2) |
pip install grilly optimum-grilly
Examples
See examples/ for runnable scripts:
hello_grilly.py— Autograd forward + backwardtrain_mlp.py— Full training loop with AdamW and cross-entropybenchmark_gemm.py— GPU vs CPU GEMM throughputclassifier.py— Simple classifier example- 13 experimental examples (VSA, MoE, capsules, cognitive control, and more)
Architecture
grilly/
├── backend/ # Vulkan GPU dispatch (core.py, compute.py, pipelines.py, autograd_core.py)
├── cpp/ # C++ pybind11 extension (grilly_core) — 16 native ops
├── nn/ # PyTorch-like nn.Module layers, SNN framework, multimodal fusion
├── functional/ # Stateless F.* API (mirrors torch.nn.functional)
├── optim/ # Optimizers (AdamW, Adam, SGD, NLMS, NaturalGradient, Hypergradient)
├── utils/ # DataLoader, Dataset, HuggingFaceBridge, VulkanTensor, checkpointing
├── shaders/ # 161 GLSL compute shaders
│ └── spv/ # Compiled SPIR-V bytecode
├── experimental/ # Unstable: VSA, MoE routing, temporal reasoning, cognitive controller
├── howtos/ # 8 self-contained HTML tutorials
├── examples/ # Runnable example scripts
└── tests/ # Test suite (1000+ tests)
Design Principles
- Pure Vulkan — no CUDA, no vendor lock-in
- Hardware-agnostic — AMD, NVIDIA, Intel on the same codebase
- C++ dispatch layer — pybind11 extension for low-overhead GPU calls
- Zero-copy GPU memory —
VulkanTensorkeeps data GPU-resident between ops - All data is
np.float32— numpy arrays in, numpy arrays out
Environment Variables
| Variable | Description | Default |
|---|---|---|
VK_GPU_INDEX |
Select GPU by index (multi-GPU systems) | 0 |
GRILLY_DEBUG |
Enable debug logging (1 = on) |
off |
ALLOW_CPU_VULKAN |
Allow Mesa llvmpipe software Vulkan (CI) | off |
Testing
# All tests (requires Vulkan)
uv run pytest tests/ -v
# CPU-only (no GPU required)
uv run pytest tests/ -m "not gpu" -v
# With coverage
uv run pytest tests/ --cov=. --cov-report=term
# Single test
pytest tests/test_snn.py -k "test_lif"
CI/CD
- CI (on push/PR): Lint (ruff, black), test (CPU-only on Mesa llvmpipe), build
- CD (on GitHub Release): Build and publish to PyPI via Trusted Publishing (OIDC, no API tokens)
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new features
- Run
ruff check .andpytest tests/ -v - Submit a pull request
License
MIT License — see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file grilly-0.4.6.tar.gz.
File metadata
- Download URL: grilly-0.4.6.tar.gz
- Upload date:
- Size: 7.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5cb69945b2627c3ab145ecb8f01ad57827eb21451704cad57676b6e6644a735
|
|
| MD5 |
4cee9582dd0403cba92b86617e17f66b
|
|
| BLAKE2b-256 |
23b394eb6de6abc18df4ad728256cacbc501eed13735cd2ffa3a4d0c3873f120
|
Provenance
The following attestation bundles were made for grilly-0.4.6.tar.gz:
Publisher:
publish.yml on Grillcheese-AI/grilly
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
grilly-0.4.6.tar.gz -
Subject digest:
a5cb69945b2627c3ab145ecb8f01ad57827eb21451704cad57676b6e6644a735 - Sigstore transparency entry: 1108312237
- Sigstore integration time:
-
Permalink:
Grillcheese-AI/grilly@c6b96631b4d8f1e544c2fe04249d7d8303364859 -
Branch / Tag:
refs/tags/0.4.6 - Owner: https://github.com/Grillcheese-AI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c6b96631b4d8f1e544c2fe04249d7d8303364859 -
Trigger Event:
release
-
Statement type:
File details
Details for the file grilly-0.4.6-py3-none-any.whl.
File metadata
- Download URL: grilly-0.4.6-py3-none-any.whl
- Upload date:
- Size: 1.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63163ce3b78e5a3fa2275af20bee128ad2ec43d21e716a33322b935d66eeef8f
|
|
| MD5 |
9b91b852cacf7f31828f3e3cefe6fb63
|
|
| BLAKE2b-256 |
150f0daa48805de03cb9790bdd55404b883b49ae9d9849e1e4a528b01a4b423d
|
Provenance
The following attestation bundles were made for grilly-0.4.6-py3-none-any.whl:
Publisher:
publish.yml on Grillcheese-AI/grilly
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
grilly-0.4.6-py3-none-any.whl -
Subject digest:
63163ce3b78e5a3fa2275af20bee128ad2ec43d21e716a33322b935d66eeef8f - Sigstore transparency entry: 1108312251
- Sigstore integration time:
-
Permalink:
Grillcheese-AI/grilly@c6b96631b4d8f1e544c2fe04249d7d8303364859 -
Branch / Tag:
refs/tags/0.4.6 - Owner: https://github.com/Grillcheese-AI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c6b96631b4d8f1e544c2fe04249d7d8303364859 -
Trigger Event:
release
-
Statement type: