Attention Residuals (AttnRes) kernels
Project description
Flash Attention Residuals
4x faster inference/training vs. torch.compile naive attention residuals implementation
20% reduction in training memory (without activation checkpointing)*
*Benchmarked on H100. Dependent on problem size and setup.
Reference: https://arxiv.org/abs/2603.15031 (Kimi Team, MoonshotAI, 2026)
Credits:
Thanks to Mohamed Osman (https://github.com/spaghettiSystems) and Cartesia (https://github.com/cartesia-ai) for advising on and supporting the development of this project.
Install
pip install flash-attn-res
Usage
This package contains Triton kernels, triton_op wrappers compatible with torch.compile, and an experimental high-performance Block AttenRes autograd implementation.
See src and benchmarks folders.
Roadmap:
- Better autotuning defaults
- Better benchmarks
- More robust autograd impl.
- Precision tuning
- Mixed FP16 and BF16 and store quantization scale
- Stochastic rounding
- CuTE, CUDA, and other DSLs implementation
Development Notes:
- Normalizing in phase 1 keeps outputs bounded (convex combination of values) so bf16 error doesn't scale with softmax flatness. Phase 2 computes in fp32, and the reduction algebra matches split-KV Flash Attention.
- Certain dimensions, especially NUM_QUERIES_PER_BLOCK, are small so semi-elementwise (B, T) kernel with static_range is better than doing tl.dot
- Kernel is memory bound and doing semi-elementwise allows for kernel fusion
- NUM_SOURCE_BLOCKS and NUM_QUERIES_PER_BLOCK should be autotuning keys, unlike with torch.compile, which allows for faster kernels
- Small NUM_QUERIES_PER_BLOCK so eviction_policy should be "evict_last"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flash_attn_res-0.1.10.tar.gz.
File metadata
- Download URL: flash_attn_res-0.1.10.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bbc614c046062f77f2b75bd3e3ba42495742012cf989a4e10d5d49c030135f38
|
|
| MD5 |
abf2c2e86791adc38376a513fbbff645
|
|
| BLAKE2b-256 |
691e489e219a6dc31749547a15ddadecd6dd2fb2bcc25e2c608b7420270982b0
|
File details
Details for the file flash_attn_res-0.1.10-py2.py3-none-any.whl.
File metadata
- Download URL: flash_attn_res-0.1.10-py2.py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73b25274838fb1a903aa60f38e95554737f704e69b2ccb2743ff8819334cbd49
|
|
| MD5 |
432d829deae322dbbcf7f36678c85a2c
|
|
| BLAKE2b-256 |
802506ec39fc891ee4f4719c6c971b0f3b5b3412b90147051af38dc977e6348c
|