Skip to main content

Attention Residuals (AttnRes) kernels

Project description

Flash Attention Residuals

1.4x faster inference/training vs. torch.compile impl. of the paper’s two-phase batched attn. + online softmax

20% reduction in training memory (without activation checkpointing)*

*Benchmarked on H100. Dependent on problem size and setup.

Reference: https://arxiv.org/abs/2603.15031 (Kimi Team, MoonshotAI, 2026)

Credits:

Thanks to Mohamed Osman (https://github.com/spaghettiSystems) and Cartesia (https://github.com/cartesia-ai) for advising on and supporting the development of this project.

Install

pip install flash-attn-res

Usage

This package contains Triton kernels, triton_op wrappers compatible with torch.compile, and an experimental high-performance Block AttenRes autograd implementation. See src and examples folders.

Roadmap:

  • More robust autograd impl.
  • Precision tuning
  • Mixed FP16 and BF16 and store quantization scale
  • Stochastic rounding
  • CuTE, CUDA, and other DSLs implementation

Development Notes:

  • Normalizing in phase 1 keeps outputs bounded (convex combination of values) so bf16 error doesn't scale with softmax flatness. Phase 2 computes in fp32, and the reduction algebra matches split-KV Flash Attention.
  • Certain dimensions, especially NUM_QUERIES_PER_BLOCK, are small so semi-elementwise (B, T) kernel with static_range is better than doing tl.dot
  • Kernel is memory bound and doing semi-elementwise allows for kernel fusion
  • NUM_SOURCE_BLOCKS and NUM_QUERIES_PER_BLOCK should be autotuning keys, unlike with torch.compile, which allows for faster kernels
  • Small NUM_QUERIES_PER_BLOCK so eviction_policy should be "evict_last"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flash_attn_res-0.1.8.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flash_attn_res-0.1.8-py2.py3-none-any.whl (15.8 kB view details)

Uploaded Python 2Python 3

File details

Details for the file flash_attn_res-0.1.8.tar.gz.

File metadata

  • Download URL: flash_attn_res-0.1.8.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for flash_attn_res-0.1.8.tar.gz
Algorithm Hash digest
SHA256 a4f4d452baf4e2e525e478522f3bc2b0c4081784b68cabcef98ae76e5b246664
MD5 d04760e7725b9d05b3083bf20ffed980
BLAKE2b-256 9c987d662e5aabf41755171f846e9da9b91b2296d8b1ee8144cb92386ffc2536

See more details on using hashes here.

File details

Details for the file flash_attn_res-0.1.8-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for flash_attn_res-0.1.8-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 dbc99ba0b2d03b6c6ee2899cf84c13ea3c5d3d1ecc9dfdf81db74a847bdb4d3e
MD5 7d1d8d97dd16642ad97fb82c1d1b5029
BLAKE2b-256 060ac86174e1d7a7ff13bd7ff0b0c0f2dc5a0bc1579ae583d2cf8162c9b6043f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page