Implementation of various Transformer Attention mechanisms proposed by frontier LLM labs.

These details have not been verified by PyPI

Project links

Project description

attnhut

A collection of Transformer Attention mechanisms in PyTorch, all in one place.

To hut something is to house it or give it shelter, and attnhut houses attention mechanism implementations. Frontier labs and conference papers keep shipping new ways to do attention (sparse, latent, compressed, corrected), but the code usually lives buried inside a giant training repo or never gets released at all. attnhut collects clean, readable implementations of these so a research lab can import one and try it the same afternoon. The goal is to make it easy to study what the frontier is doing and to build on it together in the open.

Everything is a plain nn.Module. You build it with ints and call it on a (batch, seq, dim) tensor. No config objects, no framework, no ceremony.

from attnhut import GroupedQueryAttention

attn = GroupedQueryAttention(dim=512, num_heads=8, num_kv_heads=2)
y = attn(x)            # x and y are (batch, seq, dim)

These are reference implementations. They are written to be read and to be correct, not to win a kernel benchmark (see Notes). Each mechanism is one short file you can read in a sitting, so you can see what DeepSeek or MiniMax actually do and then take it from there.

Install
Mechanisms
MiniMax Sparse Attention
Heavily Compressed and Compressed Sparse Attention
Causal-JEPA Attention
DeepSeek Sparse Attention
Delta Attention
Differential Attention
Multi-head Latent Attention
GQA
BigBird
Slot Attention
MQA
Standard
Notes
Contributing
Tests

Install

pip install attnhut

Or with uv.

uv add attnhut

To work on attnhut itself, clone the repo and sync the dev environment.

git clone https://github.com/egmaminta/attnhut.git
cd attnhut
uv sync

Mechanisms

Module	Idea	Reference
`MiniMaxSparseAttention`	top k block selection on GQA	MiniMax, 2026
`CompressedSparseAttention`	light compression plus selection	DeepSeek-AI, 2026
`HeavilyCompressedAttention`	heavy KV compression	DeepSeek-AI, 2026
`CausalJEPAAttention`	object level masking with bidirectional attention	Nam et al., 2026
`DeepSeekSparseAttention`	lightning indexer top k tokens	DeepSeek-AI, 2025
`DeltaAttention`	correction for sliding window	Willette et al., 2025
`DifferentialAttention`	two softmax maps subtracted	Ye et al., 2025
`MultiHeadLatentAttention`	low rank latent KV cache	DeepSeek-AI, 2024
`GroupedQueryAttention`	key value heads shared in groups	Ainslie et al., 2023
`BigBirdAttention`	global plus window plus random blocks	Zaheer et al., 2020
`SlotAttention`	iterative slots that compete	Locatello et al., 2020
`MultiQueryAttention`	one shared key value head	Shazeer, 2019
`StandardAttention`	full multi head attention	Vaswani et al., 2017

MiniMax Sparse Attention

A cheap index branch scores key blocks with a single shared index key and one index query per GQA group, max pools the token scores into block scores, and keeps the top k blocks per group. The main attention then runs over the selected blocks. Selection is block level.

from attnhut import MiniMaxSparseAttention, msa_index_aux_loss

attn = MiniMaxSparseAttention(dim, num_heads, num_kv_groups, block_size=64, top_k=8)
out, aux = attn(x, return_aux=True)

Hard top k is not differentiable, so the index projections get no gradient from the forward pass. Add msa_index_aux_loss(aux["block_scores"], aux["attn_weights"], block_size, group_size) to the loss to train the selector.

Heavily Compressed and Compressed Sparse Attention

The DeepSeek V4 hybrid pair. Both pool every few tokens into one KV entry with learned position biased softmax weights, then run shared key value MQA over the compressed entries plus a short uncompressed sliding window, with a learnable attention sink. HCA compresses hard and keeps everything. CSA compresses lightly with overlap and then selects the top k entries with a lightning indexer.

HeavilyCompressedAttention(dim, num_heads, compression_rate=16, window=64)
CompressedSparseAttention(dim, num_heads, compression_rate=4, top_k=16, window=64)

Causal-JEPA Attention

A world model predictor that learns how objects interact by hiding objects and asking attention to put them back. Object slots are laid out as a grid of time by objects and flattened into one sequence. A random set of objects is hidden across the history and the whole future is hidden, then a bidirectional transformer rebuilds every hidden slot from the visible ones. Each hidden slot starts as a learned mask token plus a temporal embedding plus a linear projection of that object at the first frame, which keeps its identity. Nothing encodes object order, so the predictor does not care how the slots are arranged.

Input is (batch, history_frames, num_slots, dim) and the forward pass returns the full (batch, total_frames, num_slots, dim) grid along with the hidden object indices. Call predict to roll out the future from a fully visible history.

from attnhut import CausalJEPAAttention, cjepa_masked_loss

attn = CausalJEPAAttention(dim, num_heads, num_slots, history_frames, pred_frames=1)
pred, masked = attn(slots)
loss = cjepa_masked_loss(pred, target, masked, history_frames)

DeepSeek Sparse Attention

A small lightning indexer scores every query key pair with gated ReLU instead of softmax, then each query keeps only its top k keys. Selection is token level, which is the difference from MiniMax block selection.

from attnhut import DeepSeekSparseAttention, dsa_index_aux_loss

attn = DeepSeekSparseAttention(dim, num_heads, top_k=64)
out, aux = attn(x, return_aux=True)

Train the indexer with dsa_index_aux_loss(aux["index_scores"], dense_attn_weights), a KL warmup toward the dense attention spread.

Delta Attention

Sliding window attention shifts the output away from full attention because each row renormalizes over a different key set. Delta Attention runs full attention on every stride th query, takes the gap between dense and sparse at those anchors, and adds it back to every row in the block. Training free, so you can bolt it onto a model you already have.

DeltaAttention(dim, num_heads, window=256, sink=4, stride=16)

Differential Attention

Each head computes two softmax attention maps and subtracts one from the other, scaled by a learned lambda. The common noise in the two maps cancels, so the head puts less weight on irrelevant context. A per head RMSNorm keeps the magnitude in line. lambda_init grows with depth, so pass the layer index.

DifferentialAttention(dim, num_heads, depth=0, causal=True)

Multi-head Latent Attention

Keys and values are cached as one low rank latent per token instead of full per head K and V. Position rides on a small decoupled RoPE part kept out of the latent. This is the trick behind DeepSeek's tiny KV cache.

MultiHeadLatentAttention(dim, num_heads, kv_lora_rank=None, q_lora_rank=None)

GQA

Query heads share key value heads in groups, sitting between full multi head attention and MQA. This is what most current open models use.

GroupedQueryAttention(dim, num_heads, num_kv_heads, causal=False)

BigBird

A good first sparse attention to read. A query attends to a few global blocks, a local window of neighbor blocks, and a few random blocks. The union of those is a boolean mask and the rest is ordinary masked softmax, so the pattern is the only new idea.

BigBirdAttention(dim, num_heads, block_size=64, num_window_blocks=3,
                 num_random_blocks=3, num_global_blocks=2, causal=False)

Slot Attention

Object centric attention. A small set of slots compete to explain the input features, with the softmax taken over the slots rather than the inputs, and the slots refined over a few iterations with a GRU. Unlike the others this maps a set of inputs to a set of slots, so the output is (batch, num_slots, dim).

SlotAttention(dim, num_slots, num_iters=3)

MQA

All query heads read a single key value head, which shrinks the KV cache the most.

MultiQueryAttention(dim, num_heads, causal=False)

Standard

Plain multi head attention where every query head keeps its own key value head. The reference point the others compress or sparsify.

StandardAttention(dim, num_heads, causal=False, dropout=0.0)

Notes

The sparse modules are masked dense references. The selection logic is exact and easy to read, but the wall clock speedups reported in the papers need fused gather kernels that are out of scope here. The same applies to the V4 compressed modules, where partial RoPE is left out for clarity. In other words, use these to understand a mechanism and to prototype, not as a drop in fast kernel.

Contributing

Pull requests are welcome. If a lab or a paper has an attention variant worth reading, this is a good home for a clean version of it. The bar is one file per mechanism, a plain nn.Module with the same call shape as the rest, a short test that checks shape and causality, and a pointer to the paper. Keep it readable over clever.

Tests

uv run pytest

License

MIT.

Built by

Emmanuel G. Maminta (emmanuel.maminta@eee.upd.edu.ph, egmaminta@up.edu.ph), Ubiquitous Computing Laboratory, Artificial Intelligence Graduate Program, University of the Philippines, Diliman, Quezon City, Philippines.

How to cite

If attnhut helped your research or project, please cite it.

@software{maminta2026attnhut,
  author  = {Maminta, Emmanuel G.},
  title   = {attnhut: A collection of Transformer Attention mechanisms in PyTorch},
  year    = {2026},
  url     = {https://github.com/egmaminta/attnhut},
  version = {0.4.1},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.1

Jun 6, 2026

0.4.0

Jun 6, 2026

0.3.0

Jun 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

attnhut-0.4.1.tar.gz (51.8 kB view details)

Uploaded Jun 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

attnhut-0.4.1-py3-none-any.whl (31.6 kB view details)

Uploaded Jun 6, 2026 Python 3

File details

Details for the file attnhut-0.4.1.tar.gz.

File metadata

Download URL: attnhut-0.4.1.tar.gz
Upload date: Jun 6, 2026
Size: 51.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for attnhut-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`4a447825a269bd12ee84c785d02ec2c4a5a2a7a27274d35a935c4952158dc8ac`
MD5	`3d246810cdb8cf223787aa13bd44a0a8`
BLAKE2b-256	`e591eaeedd424b8a81a3a26bbd7f38258e51554f65729c290c32fd12a2ab330f`

See more details on using hashes here.

File details

Details for the file attnhut-0.4.1-py3-none-any.whl.

File metadata

Download URL: attnhut-0.4.1-py3-none-any.whl
Upload date: Jun 6, 2026
Size: 31.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for attnhut-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ccaed27666f62efe0117a5766a245b8d20db584c899cce2d2f91dab7c8c74f3b`
MD5	`514c07694b8252dd7981d84ff7e4257f`
BLAKE2b-256	`cbfa636585bde27f9309a4e495fbe00f3da27131830bd03f26d51d3376e609c3`

See more details on using hashes here.

attnhut 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

attnhut

Contents

Install

Mechanisms

MiniMax Sparse Attention

Heavily Compressed and Compressed Sparse Attention

Causal-JEPA Attention

DeepSeek Sparse Attention

Delta Attention

Differential Attention

Multi-head Latent Attention

GQA

BigBird

Slot Attention

MQA

Standard

Notes

Contributing

Tests

License

Built by

How to cite

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes