FFPA: Yet another Faster Flash Prefill Attention for large headdim, 1.8x~3x faster than SDPA EA.
Project description
🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑
📈L20 ~1.9x↑🎉 | 📈A30 ~1.8x↑🎉 | 📈3080 ~2.9x↑🎉 | 📈4090 ~2.1x↑🎉
FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.8x~3x 🎉 faster than SDPA. 👇Core features:
| Self Attn | GQA | Cross Attn | Causal | Headdim | Fwd (CUDA)↑ | Bwd (Triton)↑ |
|---|---|---|---|---|---|---|
✔️(Nq = Nkv) |
✔️ | ✔️(Nq != Nkv) |
✔️ | 320~1024 | 1.8x~3x↑🎉 | 1.5x~2.5x↑🎉 |
📖 Quick Start
First, install the prebuilt package from PyPI or build ffpa-attn from source:
# Required: PyTorch>=2.11.0, CUDA>=13.0, Ubuntu>=22.04
pip3 install -U ffpa-attn # (support: sm_{80,90,...,120})
# Or, build ffpa-attn from source, just follow the cmds:
git clone https://github.com/xlite-dev/ffpa-attn.git
# Then, build the wheel package and install it with pip
cd ffpa-attn && MAX_JOBS=32 python3 setup.py bdist_wheel
# Optional: build ffpa-attn with ccache for faster rebuilds
apt install ccache && bash tools/build_fast.sh bdist_wheel
# Optional: for editable whl, use `pip install -e .` instead.
pip3 install dist/ffpa_attn-*.whl # pip uninstall ffpa-attn -y
Then, try to accelerate the attention for large headdim with just one-line of code:
>>> import torch.nn.functional as F
>>> from ffpa_attn import ffpa_attn_func
>>> # Monkey-patch SDPA to point to FFPA attention. Every thing that
>>> # FFPA does not support will automatically fallback to SDPA. For
>>> # example, if the user calls SDPA with headdim <= 256 or > 1024,
>>> # attn_mask not None, and dropout_p > 0.0, etc.
>>> F.scaled_dot_product_attention = ffpa_attn_func # one-line code
For more advanced features, please refer to our online docs at 📘ffpa-attn.io.
📖 Split-D
We extend FlashAttention to support large headdim ($D>256$) via fine-grained tiling at the MMA level for $QK^\top$ and $PV$ matrix multiplication, referred to as Split-D. This design keeps SRAM usage fixed at $B_r \times 16$ (with $B_r=B_c$) for Q, K and V, yielding constant SRAM complexity $O(B_r \times 16) \approx O(1)$ and register complexity $O(d/4)$.
[!NOTE] FFPA has been tested on
Ampere,Ada,Hopper, andBlackwellarchitectures (e.g., A30, L20, 4090, H200, 5090), achieves1.8×~3×↑🎉forward and1.5×~2.5×↑🎉backward speedup over SDPA.
🎉 Benchmark
Runnable examples are provided under examples. The performance benchmark for the 4090 with large headdim (D=320~1024) is shown below. Please refer to our bench for more details.
©️License
Apache License 2.0
©️Citations
@misc{ffpa-attn@2025,
title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
url={https://github.com/xlite-dev/ffpa-attn.git},
note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
author={DefTruth},
year={2025}
}
📖 References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ffpa_attn-0.1.6-cp314-cp314-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: ffpa_attn-0.1.6-cp314-cp314-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 55.8 MB
- Tags: CPython 3.14, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c46b54b930875c708cbf5118fea292f375e4480cfdc53939e0dacd02b1e8393
|
|
| MD5 |
a42451fd0ff7d818327c110deafda0a3
|
|
| BLAKE2b-256 |
e25df7f3f5e18854460cd2ea381dfe3f94b95385a55744ee686e0f7b7bcd2918
|
File details
Details for the file ffpa_attn-0.1.6-cp313-cp313-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: ffpa_attn-0.1.6-cp313-cp313-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 55.8 MB
- Tags: CPython 3.13, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe3d62213b56425358cdc84f61a27647294e7fe3962f7f26969812f6349a6020
|
|
| MD5 |
0fc903c3bd23e3c37348507dac847d2d
|
|
| BLAKE2b-256 |
d5394d28285a494f00723b88371d9bc047d0dfcf23dc3d15a7af2e028f221eaf
|
File details
Details for the file ffpa_attn-0.1.6-cp312-cp312-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: ffpa_attn-0.1.6-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 55.8 MB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7f3c093b75f9f8e7f9b9dd734bb0809de5f58e0cb53a76885c907ce14dd11256
|
|
| MD5 |
613b12a4be69336c31362c325afca5c3
|
|
| BLAKE2b-256 |
b20af1bc1f2552f02dcffd8cb8ab3ef45908f5a244cec2ccf1a56bc817375a90
|
File details
Details for the file ffpa_attn-0.1.6-cp311-cp311-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: ffpa_attn-0.1.6-cp311-cp311-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 55.8 MB
- Tags: CPython 3.11, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5fccefa800420de13105ddc790ddc767f49caf0fa4a820c6abe7a4b7449667ec
|
|
| MD5 |
6c6fdf9e1f6058757e81a05fb4124fbc
|
|
| BLAKE2b-256 |
f3919e9d94bef441867d2231ab0a7eb87d5d186f51e0c0a312d7eb6c3a12a608
|
File details
Details for the file ffpa_attn-0.1.6-cp310-cp310-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: ffpa_attn-0.1.6-cp310-cp310-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 55.8 MB
- Tags: CPython 3.10, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f880d84fdff13200033832771062cc15b4e5007033d6be430b9ff4c7f81f40c5
|
|
| MD5 |
a941880cc5ac5f818d8efd054aa9d700
|
|
| BLAKE2b-256 |
8d33d74b04209f05b6b860dcc1c9229128a196ea2f8b8f768854b0a47229284e
|