Skip to main content

FFPA: Yet another Faster Flash Prefill Attention for large headdim, 1.5~3x faster than SDPA.

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑


FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.5~3x 🎉 faster than SDPA. 📚👇The Core features:

Self Attn GQA/MQA Cross Attn Causal/Mask Dropout Headdim Fwd/Bwd
✔️(Nq=Nkv) ✔️(Hq!=Hkv) ✔️(Nq!=Nkv) ✔️(attn_mask) ✔️(p>0) 320~1024 1.5~3x↑

📖 Quick Start

First, install the prebuilt package from PyPI or build ffpa-attn from source:

# Fisrt, install the prebuilt package from PyPI
pip3 install -U ffpa-attn # (support: sm_{80,...,120})
# Or, build ffpa-attn from source, just follow the cmds
git clone https://github.com/xlite-dev/ffpa-attn.git
# Then, build the wheel package (Triton + CuTeDSL backends)
cd ffpa-attn && pip3 install -e . --no-build-isolation
# Optional: install ffpa-attn with CUDA backend support
ENABLE_FFPA_CUDA_IMPL=1 MAX_JOBS=32 pip3 install -e .

Then, try to accelerate the attention for large headdim with just one-line of code:

>>> import torch.nn.functional as F
>>> from ffpa_attn import ffpa_attn_func
>>> # Monkey-patch SDPA to point to FFPA. Every thing that FFPA
>>> # does not support will auto fallback to SDPA: D <= 256, etc.
>>> F.scaled_dot_product_attention = ffpa_attn_func # one-line code

For more advanced features, please refer to our online docs at 📘ffpa-attn.io.

📖 Split-D

We extend FlashAttention to support large headdim ($D>256$) via fine-grained tiling at the MMA level for $QK^\top$ and $PV$ matrix multiplication, referred to as Split-D. This design keeps SRAM usage fixed at $B_r \times 16$ (with $B_r=B_c$) for Q, K and V, yielding constant SRAM complexity $O(B_r \times 16) \approx O(1)$ and register complexity $O(d/4)$.

FFPA enables headdim > 256, and outperforms standard SDPA by 1.5~3x🎉.

[!NOTE] FFPA has been tested on Ampere, Ada, Hopper, and Blackwell architectures (e.g., A30, L20, 4090, H200, 5090), achieves 1.5~3×↑🎉 speedup over SDPA. FFPA is mainly design for prefill and large headdim, and may not be faster than SDPA for 😈 small sequence length (N<512) or small headdim (D<=256).

🎉 Benchmark

Runnable examples are provided under examples. The performance benchmarks for the NVIDIA L20 (Ada), NVIDIA Geforce RTX 5090 (Blackwell), NVIDIA H800 PCIE (Hopper), NVIDIA H200 SXM (Hopper, CuTeDSL backend, up to 427 TFLOPS!🎉) with large headdim are shown below:




🤖 Backends

FFPA supports multiple backends for the forward and backward pass, including: SDPA (baseline), CUDA (forward only), Triton, and CuTeDSL. The CuTeDSL backend is currently in early stage and has some constraints (e.g., D <= 512), but it can achieve up to 427🎉 TFLOPS on H200! Stay tuned for future updates.

Backend Arch Fwd Bwd Headdim Autotune Speedup Recommend
SDPA sm>=75 All 1.0x🤗 sm>=75
CUDA sm>=80 320~1024 1.5x~3x🎉 sm80~89,120
Triton sm>=80 320~1024 1.5x~3x🎉 sm>=80
CuTeDSL sm>=80 320~1024 1.5x~2x🎉 sm80~89,120
CuTeDSL sm90 320~512 3x~6x🎉 sm90

Special thanks to Butterfingrz for contributing to the CuTeDSL backend! Awesome work!🎉

How to use different backends for your own scenario? Users can simply pass the Backend configs (SDPABackend, CUDABackend, TritonBackend or CuTeDSLBackend) to ffpa_attn_func, for example:

>>> from ffpa_attn import ffpa_attn_func, CuTeDSLBackend
>>> # CuTeDSL backend, D=512 scenario, fastest on H200!🎉
>>> o = ffpa_attn_func(q, k, v, backend=CuTeDSLBackend())

©️License

Apache License 2.0

©️Citations

@misc{ffpa-attn@2025,
  title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
  url={https://github.com/xlite-dev/ffpa-attn.git},
  note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
  author={DefTruth, Butterfingrz},
  year={2025}
}

📖 References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ffpa_attn-0.1.14-cp314-cp314-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.14-cp313-cp313-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.14-cp312-cp312-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.14-cp311-cp311-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.14-cp310-cp310-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file ffpa_attn-0.1.14-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.14-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 683ed3f6061dc1d629aad168c0fd8fd995cfbd9b2743a306f0b6931e3c7e940f
MD5 a3df1419d5a26c012ab9978b46398d9b
BLAKE2b-256 4333f42c2a24a18088133f6aa5c5c55b5ac5cefee5b4c60562a6c26c47e89909

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.14-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.14-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 053242a071f3ecabfabe3d28d2d86f11929dae54082371c6ca83d495db962d61
MD5 0a4ae53a78e8cb56953c74f22787a149
BLAKE2b-256 1a1f33191625f3042f0207a08239d83500a3637db25b3d4eae3471f7e02900ca

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.14-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.14-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 396f5411c1637450256ef12c58a8ba59f05500f55acf66cd767e00188c16e0a0
MD5 e7c5b26822b06ce55cd39b749268df83
BLAKE2b-256 3ea738058b9dfe2440c4143756fac05471e19ef1f2f85d73daebf3693a94fc5d

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.14-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.14-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 7886aa5ca163cbfde1ead65deed0b6835a493ad05cba8a1b8110de4ba548e00b
MD5 3dc3c078a2a2d960a3b2b1eb95cb0158
BLAKE2b-256 a769c2d8411c6de47d8b7e6f961c8a771794e5b96821b1ed89d73dcd4ef26e80

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.14-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.14-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 55d73df48ae3fbd5cc1022bb7c5c9b92653f1d7aad5df9b7eff1f913d213e746
MD5 d365f55ff74529cd902d95f8ea485f34
BLAKE2b-256 3f69a641c282f170e62342b7222de76f9baddd6277ac446d096bab9fe10b653c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page