Skip to main content

Attention Residuals (AttnRes) kernels

Project description

Flash Attention Residuals

1.4x faster inference/training vs. an optimized torch.compile impl. of the paper’s two-phase batched attention with online softmax

20% reduction in training memory (without activation checkpointing)*

*Benchmarked on H100. Dependent on problem size and setup.

Credits:

Thanks to Mohamed Osman (https://github.com/spaghettiSystems) and Cartesia for advising on and supporting the development of this kernel.

Install

pip install flash-attn-res

Roadmap:

  • Proper backward eval
  • Implement in CuTE and CUDA
  • Tune precision
  • Mixed FP16 and BF16 and store quantization scale
  • Stochastic rounding
  • Make into Python package

Insights:

  • Normalizing in phase 1 keeps outputs bounded (convex combination of values) so bf16 error doesn't scale with softmax flatness. Phase 2 computes in fp32, and the reduction algebra matches split-KV Flash Attention.
  • Certain dimensions, especially NUM_QUERIES_PER_BLOCK, are small so semi-elementwise (B, T) kernel with static_range is better than doing tl.dot
  • Kernel is memory bound and doing semi-elementwise allows for kernel fusion
  • NUM_SOURCE_BLOCKS and NUM_QUERIES_PER_BLOCK should be autotuning keys, unlike with torch.compile, which allows for faster kernels
  • Small NUM_QUERIES_PER_BLOCK so eviction_policy should be "evict_last"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flash_attn_res-0.1.6.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flash_attn_res-0.1.6-py2.py3-none-any.whl (15.6 kB view details)

Uploaded Python 2Python 3

File details

Details for the file flash_attn_res-0.1.6.tar.gz.

File metadata

  • Download URL: flash_attn_res-0.1.6.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for flash_attn_res-0.1.6.tar.gz
Algorithm Hash digest
SHA256 fe0a7d6cb05dc1a1dfa51475ca7e96a33d6caca5f18e26dc6530d68ee5647090
MD5 4cc40097d15386f72f50458d2b3b9c7e
BLAKE2b-256 2efebcf8e6867b93ba43fadab174e8b76aebf23e0c60f6fddb75ac1db8435108

See more details on using hashes here.

File details

Details for the file flash_attn_res-0.1.6-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for flash_attn_res-0.1.6-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ccd1b6d60c294c8e5685c4cdff4667bd0bfd0b1d33a9bbabb0f1b33fda0d799e
MD5 6bc85d2b787b460b124bbead76f2943a
BLAKE2b-256 feafbe51e98d5a80bb7aac43303dcd107ac3122405932a1454238e0e6239e307

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page