Implementation of various Transformer Attention mechanisms proposed by frontier LLM labs.
Project description
attnhut
A collection of Transformer Attention mechanisms in PyTorch, all in one place.
To hut something is to house it or give it shelter, and attnhut houses attention mechanism implementations. Frontier labs and conference papers keep shipping new ways to do attention (sparse, latent, compressed, corrected), but the code usually lives buried inside a giant training repo or never gets released at all. attnhut collects clean, readable implementations of these so a research lab can import one and try it the same afternoon. The goal is to make it easy to study what the frontier is doing and to build on it together in the open.
Everything is a plain nn.Module. You build it with ints and call it on a
(batch, seq, dim) tensor. No config objects, no framework, no ceremony.
from attnhut import GroupedQueryAttention
attn = GroupedQueryAttention(dim=512, num_heads=8, num_kv_heads=2)
y = attn(x) # x and y are (batch, seq, dim)
These are reference implementations. They are written to be read and to be correct, not to win a kernel benchmark (see Notes). Each mechanism is one short file you can read in a sitting, so you can see what DeepSeek or MiniMax actually do and then take it from there.
Contents
- Install
- Mechanisms
- MiniMax Sparse Attention
- Heavily Compressed and Compressed Sparse Attention
- DeepSeek Sparse Attention
- Delta Attention
- Differential Attention
- Multi-head Latent Attention
- GQA
- BigBird
- Slot Attention
- MQA
- Standard
- Notes
- Contributing
- Tests
Install
pip install attnhut
Or with uv.
uv add attnhut
To work on attnhut itself, clone the repo and sync the dev environment.
git clone https://github.com/egmaminta/attnhut.git
cd attnhut
uv sync
Mechanisms
| Module | Idea | Reference |
|---|---|---|
MiniMaxSparseAttention |
top k block selection on GQA | MiniMax, 2026 |
CompressedSparseAttention |
light compression plus selection | DeepSeek-AI, 2026 |
HeavilyCompressedAttention |
heavy KV compression | DeepSeek-AI, 2026 |
DeepSeekSparseAttention |
lightning indexer top k tokens | DeepSeek-AI, 2025 |
DeltaAttention |
correction for sliding window | Willette et al., 2025 |
DifferentialAttention |
two softmax maps subtracted | Ye et al., 2025 |
MultiHeadLatentAttention |
low rank latent KV cache | DeepSeek-AI, 2024 |
GroupedQueryAttention |
key value heads shared in groups | Ainslie et al., 2023 |
BigBirdAttention |
global plus window plus random blocks | Zaheer et al., 2020 |
SlotAttention |
iterative slots that compete | Locatello et al., 2020 |
MultiQueryAttention |
one shared key value head | Shazeer, 2019 |
StandardAttention |
full multi head attention | Vaswani et al., 2017 |
MiniMax Sparse Attention
A cheap index branch scores key blocks with a single shared index key and one index query per GQA group, max pools the token scores into block scores, and keeps the top k blocks per group. The main attention then runs over the selected blocks. Selection is block level.
from attnhut import MiniMaxSparseAttention, msa_index_aux_loss
attn = MiniMaxSparseAttention(dim, num_heads, num_kv_groups, block_size=64, top_k=8)
out, aux = attn(x, return_aux=True)
Hard top k is not differentiable, so the index projections get no gradient from
the forward pass. Add msa_index_aux_loss(aux["block_scores"], aux["attn_weights"], block_size, group_size) to the loss to train the selector.
Heavily Compressed and Compressed Sparse Attention
The DeepSeek V4 hybrid pair. Both pool every few tokens into one KV entry with learned position biased softmax weights, then run shared key value MQA over the compressed entries plus a short uncompressed sliding window, with a learnable attention sink. HCA compresses hard and keeps everything. CSA compresses lightly with overlap and then selects the top k entries with a lightning indexer.
HeavilyCompressedAttention(dim, num_heads, compression_rate=16, window=64)
CompressedSparseAttention(dim, num_heads, compression_rate=4, top_k=16, window=64)
DeepSeek Sparse Attention
A small lightning indexer scores every query key pair with gated ReLU instead of softmax, then each query keeps only its top k keys. Selection is token level, which is the difference from MiniMax block selection.
from attnhut import DeepSeekSparseAttention, dsa_index_aux_loss
attn = DeepSeekSparseAttention(dim, num_heads, top_k=64)
out, aux = attn(x, return_aux=True)
Train the indexer with dsa_index_aux_loss(aux["index_scores"], dense_attn_weights),
a KL warmup toward the dense attention spread.
Delta Attention
Sliding window attention shifts the output away from full attention because each row renormalizes over a different key set. Delta Attention runs full attention on every stride th query, takes the gap between dense and sparse at those anchors, and adds it back to every row in the block. Training free, so you can bolt it onto a model you already have.
DeltaAttention(dim, num_heads, window=256, sink=4, stride=16)
Differential Attention
Each head computes two softmax attention maps and subtracts one from the other, scaled by a learned lambda. The common noise in the two maps cancels, so the head puts less weight on irrelevant context. A per head RMSNorm keeps the magnitude in line. lambda_init grows with depth, so pass the layer index.
DifferentialAttention(dim, num_heads, depth=0, causal=True)
Multi-head Latent Attention
Keys and values are cached as one low rank latent per token instead of full per head K and V. Position rides on a small decoupled RoPE part kept out of the latent. This is the trick behind DeepSeek's tiny KV cache.
MultiHeadLatentAttention(dim, num_heads, kv_lora_rank=None, q_lora_rank=None)
GQA
Query heads share key value heads in groups, sitting between full multi head attention and MQA. This is what most current open models use.
GroupedQueryAttention(dim, num_heads, num_kv_heads, causal=False)
BigBird
A good first sparse attention to read. A query attends to a few global blocks, a local window of neighbor blocks, and a few random blocks. The union of those is a boolean mask and the rest is ordinary masked softmax, so the pattern is the only new idea.
BigBirdAttention(dim, num_heads, block_size=64, num_window_blocks=3,
num_random_blocks=3, num_global_blocks=2, causal=False)
Slot Attention
Object centric attention. A small set of slots compete to explain the input features, with the softmax taken over the slots rather than the inputs, and the slots refined over a few iterations with a GRU. Unlike the others this maps a set of inputs to a set of slots, so the output is (batch, num_slots, dim).
SlotAttention(dim, num_slots, num_iters=3)
MQA
All query heads read a single key value head, which shrinks the KV cache the most.
MultiQueryAttention(dim, num_heads, causal=False)
Standard
Plain multi head attention where every query head keeps its own key value head. The reference point the others compress or sparsify.
StandardAttention(dim, num_heads, causal=False, dropout=0.0)
Notes
The sparse modules are masked dense references. The selection logic is exact and easy to read, but the wall clock speedups reported in the papers need fused gather kernels that are out of scope here. The same applies to the V4 compressed modules, where partial RoPE is left out for clarity. In other words, use these to understand a mechanism and to prototype, not as a drop in fast kernel.
Contributing
Pull requests are welcome. If a lab or a paper has an attention variant worth
reading, this is a good home for a clean version of it. The bar is one file per
mechanism, a plain nn.Module with the same call shape as the rest, a short test
that checks shape and causality, and a pointer to the paper. Keep it readable
over clever.
Tests
uv run pytest
License
MIT.
Built by
Emmanuel G. Maminta (emmanuel.maminta@eee.upd.edu.ph, egmaminta@up.edu.ph), Ubiquitous Computing Laboratory, Artificial Intelligence Graduate Program, University of the Philippines, Diliman, Quezon City, Philippines.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file attnhut-0.3.0.tar.gz.
File metadata
- Download URL: attnhut-0.3.0.tar.gz
- Upload date:
- Size: 48.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
53e772d2884646421b620bb17e470a0c12c46746f4f249279e58e88a16dc23f8
|
|
| MD5 |
9003db4182f7dee772a45f64d29a02d0
|
|
| BLAKE2b-256 |
9c6e24673181afa4597aff5a67bc5bb9e28a1118b62e5c23ec60d62609826ddf
|
File details
Details for the file attnhut-0.3.0-py3-none-any.whl.
File metadata
- Download URL: attnhut-0.3.0-py3-none-any.whl
- Upload date:
- Size: 28.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f71bb984608d62dcfc98e1256663f81c72efee276cc3c7fa6880be2bcb6ab5a0
|
|
| MD5 |
ab0f2d22afb10be77f0896c8dbc5460f
|
|
| BLAKE2b-256 |
c8df6242beb59ec3cfc707edcfb7592b4cf83944cd6b51071f5a9571b3a84186
|