Skip to main content

FFPA: Yet another Faster Flash Prefill Attention for large headdim, 1.5~3x faster than SDPA.

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑


FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.5~3x 🎉 faster than SDPA. 📚👇The Core features:

Self Attn GQA/MQA Cross Attn Causal/Mask Dropout Headdim Fwd/Bwd
✔️(Nq=Nkv) ✔️(Hq!=Hkv) ✔️(Nq!=Nkv) ✔️(attn_mask) ✔️(p>0) 320~1024 1.5~3x↑

📖 Quick Start

First, install the prebuilt package from PyPI or build ffpa-attn from source:

# Fisrt, install the prebuilt package from PyPI
pip3 install -U ffpa-attn # (support: sm_{80,...,120})
# Or, build ffpa-attn from source, just follow the cmds
git clone https://github.com/xlite-dev/ffpa-attn.git
# Then, build the wheel package (Triton backend only)
cd ffpa-attn && pip3 install -e . --no-build-isolation
# Optional: install ffpa-attn with CuTeDSL backend
pip3 install -e ".[cutedsl]" --no-build-isolation

Then, try to accelerate the attention for large headdim with just one-line of code:

>>> import torch.nn.functional as F
>>> from ffpa_attn import ffpa_attn_func
>>> # Monkey-patch SDPA to point to FFPA. Every thing that FFPA
>>> # does not support will auto fallback to SDPA: D <= 256, etc.
>>> F.scaled_dot_product_attention = ffpa_attn_func # one-line code

For more advanced features, please refer to our online docs at 📘ffpa-attn.io.

📖 Split-D

We extend FlashAttention to support large headdim ($D>256$) via fine-grained tiling at the MMA level for $QK^\top$ and $PV$ matrix multiplication, referred to as Split-D. This design keeps SRAM usage fixed at $B_r \times 16$ (with $B_r=B_c$) for Q, K and V, yielding constant SRAM complexity $O(B_r \times 16) \approx O(1)$ and register complexity $O(d/4)$.

FFPA enables headdim > 256, and outperforms standard SDPA by 1.5~3x🎉.

[!NOTE] FFPA has been tested on Ampere, Ada, Hopper, and Blackwell architectures (e.g., A30, L20, 4090, H200, 5090), achieves 1.5~3×↑🎉 speedup over SDPA. FFPA is mainly design for prefill and large headdim, and may not be faster than SDPA for 😈 small sequence length (N<512) or small headdim (D<=256).

🎉 Benchmark

Runnable examples are provided under examples. The performance benchmarks for the NVIDIA L20 (Ada), NVIDIA Geforce RTX 5090 (Blackwell), NVIDIA H800 PCIE (Hopper), NVIDIA H200 SXM (Hopper, CuTeDSL backend, up to 427 TFLOPS!🎉) with large headdim are shown below:




🤖 Backends

FFPA supports multiple backends for the forward and backward pass, including: SDPA (baseline), CUDA (forward only), Triton, and CuTeDSL. The CuTeDSL backend is currently in early stage and has some constraints (e.g., D=512 only), but it can achieve up to 427🎉 TFLOPS on H200! Stay tuned for future updates.

Backend Arch Fwd Bwd Headdim Autotune Speedup Recommend
SDPA Ampere+ All 1.0x Ampere+
CUDA Ampere+ 320~1024 1.5x~3x🎉 Ampere, Ada
Triton Ampere+ 320~1024 1.5x~3x🎉 Ampere+
CuTeDSL Hopper 512 3x~6x🎉 Hopper

Special thanks to Butterfingrz for contributing to the CuTeDSL backend! Awesome work!🎉

How to use different backends for your own scenario? Users can simply pass the Backend configs (SDPABackend, CUDABackend, TritonBackend or CuTeDSLBackend) to ffpa_attn_func, for example:

>>> from ffpa_attn import ffpa_attn_func, CuTeDSLBackend
>>> # CuTeDSL backend for D=512 senario, fastest on H200!🎉
>>> o = ffpa_attn_func(q, k, v, backend=CuTeDSLBackend())

©️License

Apache License 2.0

©️Citations

@misc{ffpa-attn@2025,
  title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
  url={https://github.com/xlite-dev/ffpa-attn.git},
  note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
  author={DefTruth, Butterfingrz},
  year={2025}
}

📖 References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ffpa_attn-0.1.12-cp314-cp314-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.12-cp313-cp313-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.12-cp312-cp312-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.12-cp311-cp311-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.12-cp310-cp310-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file ffpa_attn-0.1.12-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.12-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 7e8e97b9cbe5a30b51fbb8bca53ec0909241b8d311813b8c979dd55a0257ec34
MD5 cf28a6e5a95ed38413e155f4644abe6b
BLAKE2b-256 f43931fd375e1aad6082c911d746966fbdd81066b8feeb8c2ee522ba3ae74a17

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.12-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.12-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 79cf6adff738581806106e432b9e6b9b222d0e2659e22728c977b34ebaefa94a
MD5 49c5d70ce58edd43d3ac1937831fba65
BLAKE2b-256 3d4a2512fd1bf7e6311b0f23ed4544826dad6e1a946cd6dcadbb4724f6526238

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.12-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.12-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 7146414df36e46180b7e944a00b20d5c231ce255981ae4f4edf02b90b389b0d8
MD5 c574e9faa244496d44bc484d0c4f026f
BLAKE2b-256 8cc4b4b18be7f0a11c4763c01dccf59aeee08dc15e8981963fb1a7c1f5848a47

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.12-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.12-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 046848a224456149f8e3cb20957b250b75d1e8ea6fabb438b2e008fb2d1cac42
MD5 5f1618b70e6ac27658e56e119f6c1dcc
BLAKE2b-256 d165dc27680695f3777f776ae9e273a165b9ccf38c7a7356bc419b3f5ef20f65

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.12-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.12-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 022baa3296b256e25cdb89852743a980ab1c9ceb25a593bdce4c3de6a54c376e
MD5 4ebae3533167cdb98294098e50643ed2
BLAKE2b-256 9d5602abd7fcd3f92df25a3d8143527ee35447ceaaada250c405c807e4ebdb55

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page