Skip to main content

FFPA: Yet another Faster Flash Prefill Attention for large headdim, 1.5~3x faster than SDPA.

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑


FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.5~3x 🎉 faster than SDPA. 📚👇The Core features:

Self Attn GQA/MQA Cross Attn Causal/Mask Dropout Headdim Fwd/Bwd
✔️(Nq=Nkv) ✔️(Hq!=Hkv) ✔️(Nq!=Nkv) ✔️(attn_mask) ✔️(p>0) 320~1024 1.5~3x↑

📖 Quick Start

First, install the prebuilt package from PyPI or build ffpa-attn from source:

# Fisrt, install the prebuilt package from PyPI
pip3 install -U ffpa-attn # (support: sm_{80,...,120})
# Or, build ffpa-attn from source, just follow the cmds
git clone https://github.com/xlite-dev/ffpa-attn.git
# Then, build the wheel package (Triton + CuTeDSL backends)
cd ffpa-attn && pip3 install -e . --no-build-isolation
# Optional: install ffpa-attn w/ CUDA backend (forward only)
ENABLE_FFPA_CUDA_IMPL=1 MAX_JOBS=32 pip3 install -e .

Then, try to accelerate the attention for large headdim with just one-line of code:

>>> import torch.nn.functional as F
>>> from ffpa_attn import ffpa_attn_func
>>> # Monkey-patch SDPA to point to FFPA. Every thing that FFPA
>>> # does not support will auto fallback to SDPA: D <= 256, etc.
>>> F.scaled_dot_product_attention = ffpa_attn_func # one-line code

For more advanced features, please refer to our online docs at 📘ffpa-attn.io.

📖 Split-D

We extend FlashAttention to support large headdim ($D>256$) via fine-grained tiling at the MMA level for $QK^\top$ and $PV$ matrix multiplication, referred to as Split-D. This design keeps SRAM usage fixed at $B_r \times 16$ (with $B_r=B_c$) for Q, K and V, yielding constant SRAM complexity $O(B_r \times 16) \approx O(1)$ and register complexity $O(d/4)$.

FFPA enables headdim > 256, and outperforms standard SDPA by 1.5~3x🎉.

[!NOTE] FFPA has been tested on Ampere, Ada, Hopper, and Blackwell architectures (e.g., A30, L20, 4090, H200, 5090), achieves 1.5~3×↑🎉 speedup over SDPA. FFPA is mainly design for prefill and large headdim, and may not be faster than SDPA for 😈 small sequence length (N<512) or small headdim (D<=256).

🎉 Benchmark

Runnable benchmark are provided under bench. The performance benchmarks for the NVIDIA L20 (Ada), NVIDIA Geforce RTX 5090 (Blackwell), NVIDIA H800 PCIE (Hopper), NVIDIA H200 SXM (Hopper, CuTeDSL backend, up to 427 TFLOPS!🎉) with large headdims can be found at bench.


🤖 Backends

FFPA supports multiple backends for the forward and backward pass, including: SDPA (baseline), CUDA (forward only), Triton, and CuTeDSL. The CuTeDSL backend is currently in early stage and has some constraints, but it can achieve up to 427🎉 TFLOPS on H200! Stay tuned for future updates.

Backend Arch Fwd Bwd Headdim Autotune Speedup Recommend
SDPA sm>=75 All 1.0x🤗 sm>=75
CUDA sm>=80 320~1024 1.5x~3x🎉 sm80~89,120
Triton sm>=80 320~1024 1.5x~5x🎉 sm>=80
CuTeDSL sm>=80 320~1024 1.5x~2x🎉 sm80~89,120
CuTeDSL sm90 320~512 3x~6x🎉 sm90

Special thanks to Butterfingrz for contributing to the CuTeDSL backend! Awesome work!🎉

How to use different backends for your own scenario? Users can simply pass the Backend configs (SDPABackend, CUDABackend, TritonBackend or CuTeDSLBackend) to ffpa_attn_func, for example:

>>> from ffpa_attn import ffpa_attn_func, CuTeDSLBackend
>>> # CuTeDSL backend, D=512 scenario, fastest on H200!🎉
>>> o = ffpa_attn_func(q, k, v, backend=CuTeDSLBackend())

©️License

Apache License 2.0

©️Citations

@misc{ffpa-attn@2025,
  title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
  url={https://github.com/xlite-dev/ffpa-attn.git},
  note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
  author={DefTruth, Butterfingrz},
  year={2025}
}

📖 References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ffpa_attn-0.1.19-py2.py3-none-any.whl (340.3 kB view details)

Uploaded Python 2Python 3

ffpa_attn-0.1.19-cp314-cp314-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.19-cp313-cp313-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.19-cp312-cp312-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.19-cp311-cp311-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.19-cp310-cp310-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file ffpa_attn-0.1.19-py2.py3-none-any.whl.

File metadata

  • Download URL: ffpa_attn-0.1.19-py2.py3-none-any.whl
  • Upload date:
  • Size: 340.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ffpa_attn-0.1.19-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 e6a7ee1200b02b93053bd266c16e1beecc71965c0a7136bb031f762c9feb6b12
MD5 54f9b066fdd1d4a68d9ff06612eb2861
BLAKE2b-256 3f4c50b0a9a4b591fc2c1c4f58ad28f81702aeb89c5eed1e966a9a7883cf43e8

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.19-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.19-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 5257c21f24d537e3af85ff7d51fc10e773c057f0babd0c838f2efe912c122cb3
MD5 002e29f5f5d2c8ffc246b76202e5ab67
BLAKE2b-256 32e691b62399955bc903c915a7a0fc8719e2b139dadf9ea1a54dc1e1edb0bd66

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.19-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.19-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 d1c832c06b7f05a2d1f39f744f639ef4a0982ae905411165181813e940ca2fc2
MD5 c974a032ea2ba7716e71f866908d0046
BLAKE2b-256 e4c790d34a6cdb6585157d2a7091bfc1992334b01f8db525015e80de6d391063

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.19-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.19-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 cfaf26edf7f512552d9ccdaa23b67b3ca3bd7b8c292219f67816193f43d25cbc
MD5 3f61c2a06296ec5fc47cf3c201cf5487
BLAKE2b-256 d550d2116dfccd0215fc19d963cd917423c0890c6e1bf06a223d2f6368724585

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.19-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.19-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 15adae43441ca88f634ff6dc0abdbe1c21984c09cec1046fdeee96dbaf74929d
MD5 ef82be1cb142cf4d87b4a178a1df8ec9
BLAKE2b-256 4934eb247e497c1ae97a7a7ca44360efd689f29e335ffad436c56581a0276459

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.19-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.19-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 b1f489b16c0cca9fb2c479e2d13b2eb8a86c1b20b66424cf18ac2ac9c905fa28
MD5 33c487b7289b7a289b9f028b46cc529c
BLAKE2b-256 d9d1ecd4cfa9cd6a4c93cb1eb0d611709f1791d923a6bd20da49fa5d5823af6c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page