Skip to main content

FFPA: Yet another Faster Flash Prefill Attention for large headdim, 1.5~3x faster than SDPA.

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑


FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.5~3x 🎉 faster than SDPA. 📚👇The Core features:

Self Attn GQA/MQA Cross Attn Causal/Mask Dropout Headdim Fwd/Bwd
✔️(Nq=Nkv) ✔️(Hq!=Hkv) ✔️(Nq!=Nkv) ✔️(attn_mask) ✔️(p>0) 320~1024 1.5~3x↑

📖 Quick Start

First, install the prebuilt package from PyPI or build ffpa-attn from source:

# Fisrt, install the prebuilt package from PyPI
pip3 install -U ffpa-attn # (support: sm_{80,...,120})
# Or, build ffpa-attn from source, just follow the cmds
git clone https://github.com/xlite-dev/ffpa-attn.git
# Then, build the wheel package (Triton backend only)
cd ffpa-attn && pip3 install -e . --no-build-isolation

Then, try to accelerate the attention for large headdim with just one-line of code:

>>> import torch.nn.functional as F
>>> from ffpa_attn import ffpa_attn_func
>>> # Monkey-patch SDPA to point to FFPA. Every thing that FFPA
>>> # does not support will auto fallback to SDPA: D <= 256, etc.
>>> F.scaled_dot_product_attention = ffpa_attn_func # one-line code

For more advanced features, please refer to our online docs at 📘ffpa-attn.io.

📖 Split-D

We extend FlashAttention to support large headdim ($D>256$) via fine-grained tiling at the MMA level for $QK^\top$ and $PV$ matrix multiplication, referred to as Split-D. This design keeps SRAM usage fixed at $B_r \times 16$ (with $B_r=B_c$) for Q, K and V, yielding constant SRAM complexity $O(B_r \times 16) \approx O(1)$ and register complexity $O(d/4)$.

FFPA enables headdim > 256, and outperforms standard SDPA by 1.5~3x🎉.

[!NOTE] FFPA has been tested on Ampere, Ada, Hopper, and Blackwell architectures (e.g., A30, L20, 4090, H200, 5090), achieves 1.5~3×↑🎉 speedup over SDPA. FFPA is mainly design for prefill and large headdim, and may not be faster than SDPA for 😈 small sequence length (N<512) or small headdim (D<=256).

🎉 Benchmark

Runnable examples are provided under examples. The performance benchmarks for the NVIDIA L20 (Ada), NVIDIA Geforce RTX 5090 (Blackwell), NVIDIA H800 PCIE (Hopper), NVIDIA H200 SXM (Hopper, CuTeDSL backend, up to 427 TFLOPS!🎉) with large headdim are shown below:




🤖 Backends

FFPA supports multiple backends for the forward and backward pass, including: CUDA (forward only), Triton, and CuTeDSL. The CuTeDSL backend is currently in early stage and has some constraints (e.g., D=512 only), but it can achieve up to 427🎉 TFLOPS on H200! We will continue to optimize the CuTeDSL backend in the future.

Backend Arch Fwd Bwd Headdim Autotune Speedup Recommend
CUDA Ampere+ 320~1024 1.5x~3x🎉 Ampere, Ada
Triton Ampere+ 320~1024 1.5x~3x🎉 Ampere+
CuTeDSL Hopper 512 3x~6x🎉 Hopper

Special thanks to Butterfingrz for contributing to the CuTeDSL backend! Awesome work!🎉

©️License

Apache License 2.0

©️Citations

@misc{ffpa-attn@2025,
  title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
  url={https://github.com/xlite-dev/ffpa-attn.git},
  note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
  author={DefTruth, Butterfingrz},
  year={2025}
}

📖 References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ffpa_attn-0.1.11-cp314-cp314-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.11-cp313-cp313-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.11-cp312-cp312-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.11-cp311-cp311-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.11-cp310-cp310-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file ffpa_attn-0.1.11-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.11-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 f2167124410870b8b300f5810c17e210dacb97ed4132dbefe632a715ba627e5c
MD5 44d5a0e5230658e71f6753de4b195ce8
BLAKE2b-256 4a6b4c4dd1c62ed4dbf141a0777c9d6e4b88dba568f6d0dea21cb9a52a3b1ab7

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.11-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.11-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 75ec2efa8867639f0e8b5260af376c39266a00eb8478b46ca8bab1f0b3aa3245
MD5 6c7a89faad0e628e9d91278898d90485
BLAKE2b-256 90dcfdba4e84108166f8e47876b3275031576cce6b677dbd563c60419a126c22

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.11-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.11-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 45b82f8218b6d24a451cb98a5cc808e1e64bba0c80c2bebd432d74ea4461982d
MD5 8751b79d40cf89d56a212268e1fb3d25
BLAKE2b-256 428589d6f678bab06f7ba16ef8eb2dd6f11260a181c5796d3e42fb811079c34b

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.11-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.11-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 62247803bfca742bfe241b41c937524b18ddad55fc25cab7978860d095689f15
MD5 c862ec528c2aec99c6980595cb4af782
BLAKE2b-256 d8015c12c905217c8058b63bf3a053d1292c3ba900f8ee9b93b838778fcd0525

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.11-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.11-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 c8d8ad7eaf5638df02934862b969909b028188410685fb444bfca802013e3840
MD5 7b14cf4b886a573b6cb8d00fb207b450
BLAKE2b-256 0b6df92b486768c2809f408839b3a5b46568b2ba218323c5e6c98325422003a0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page