Skip to main content

FFPA: Yet another Faster Flash Prefill Attention for large headdim, 1.8x~3x faster than SDPA EA.

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑

📈L20 ~1.9x↑🎉 | 📈A30 ~1.8x↑🎉 | 📈3080 ~2.9x↑🎉 | 📈4090 ~2.1x↑🎉

FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.8x~3x 🎉 faster than SDPA. 👇Core features:

Self Attn GQA Cross Attn Causal Headdim Fwd (CUDA)↑ Bwd (Triton)↑
✔️(Nq = Nkv) ✔️ ✔️(Nq != Nkv) ✔️ 320~1024 1.8x~3x↑🎉 1.5x~2.5x↑🎉

📖 Quick Start

First, install the prebuilt package from PyPI or build ffpa-attn from source:

# Required: PyTorch>=2.11.0, CUDA>=13.0, Ubuntu>=22.04
pip3 install -U ffpa-attn # (support: sm_{80,90,...,120})
# Or, build ffpa-attn from source, just follow the cmds:
git clone https://github.com/xlite-dev/ffpa-attn.git
# Then, build the wheel package and install it with pip
cd ffpa-attn && MAX_JOBS=32 python3 setup.py bdist_wheel
# Optional: build ffpa-attn with ccache for faster rebuilds
apt install ccache && bash tools/build_fast.sh bdist_wheel
# Optional: for editable whl, use `pip install -e .` instead.
pip3 install dist/ffpa_attn-*.whl # pip uninstall ffpa-attn -y

Then, try to accelerate the attention for large headdim with just one-line of code:

>>> import torch.nn.functional as F
>>> from ffpa_attn import ffpa_attn_func
>>> # Monkey-patch SDPA to point to FFPA attention. Every thing that
>>> # FFPA does not support will automatically fallback to SDPA. For
>>> # example, if the user calls SDPA with headdim <= 256 or > 1024,
>>> # attn_mask not None, and dropout_p > 0.0, etc.
>>> F.scaled_dot_product_attention = ffpa_attn_func # one-line code

For more advanced features, please refer to our online docs at 📘ffpa-attn.io.

📖 Split-D

We extend FlashAttention to support large headdim ($D>256$) via fine-grained tiling at the MMA level for $QK^\top$ and $PV$ matrix multiplication, referred to as Split-D. This design keeps SRAM usage fixed at $B_r \times 16$ (with $B_r=B_c$) for Q, K and V, yielding constant SRAM complexity $O(B_r \times 16) \approx O(1)$ and register complexity $O(d/4)$.

FFPA enables headdim > 256, and outperforms standard SDPA by 1.8x~3x🎉.

[!NOTE] FFPA has been tested on Ampere, Ada, Hopper, and Blackwell architectures (e.g., A30, L20, 4090, H200, 5090), achieves 1.8×~3×↑🎉 forward and 1.5×~2.5×↑🎉 backward speedup over SDPA.

🎉 Benchmark

Runnable examples are provided under examples. The performance benchmark for the 4090 with large headdim (D=320~1024) is shown below. Please refer to our bench for more details.

©️License

Apache License 2.0

©️Citations

@misc{ffpa-attn@2025,
  title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
  url={https://github.com/xlite-dev/ffpa-attn.git},
  note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
  author={DefTruth},
  year={2025}
}

📖 References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ffpa_attn-0.1.6-cp314-cp314-manylinux_2_34_x86_64.whl (55.8 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.6-cp313-cp313-manylinux_2_34_x86_64.whl (55.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.6-cp312-cp312-manylinux_2_34_x86_64.whl (55.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.6-cp311-cp311-manylinux_2_34_x86_64.whl (55.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.6-cp310-cp310-manylinux_2_34_x86_64.whl (55.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file ffpa_attn-0.1.6-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.6-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 8c46b54b930875c708cbf5118fea292f375e4480cfdc53939e0dacd02b1e8393
MD5 a42451fd0ff7d818327c110deafda0a3
BLAKE2b-256 e25df7f3f5e18854460cd2ea381dfe3f94b95385a55744ee686e0f7b7bcd2918

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.6-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.6-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 fe3d62213b56425358cdc84f61a27647294e7fe3962f7f26969812f6349a6020
MD5 0fc903c3bd23e3c37348507dac847d2d
BLAKE2b-256 d5394d28285a494f00723b88371d9bc047d0dfcf23dc3d15a7af2e028f221eaf

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.6-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.6-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 7f3c093b75f9f8e7f9b9dd734bb0809de5f58e0cb53a76885c907ce14dd11256
MD5 613b12a4be69336c31362c325afca5c3
BLAKE2b-256 b20af1bc1f2552f02dcffd8cb8ab3ef45908f5a244cec2ccf1a56bc817375a90

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.6-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.6-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 5fccefa800420de13105ddc790ddc767f49caf0fa4a820c6abe7a4b7449667ec
MD5 6c6fdf9e1f6058757e81a05fb4124fbc
BLAKE2b-256 f3919e9d94bef441867d2231ab0a7eb87d5d186f51e0c0a312d7eb6c3a12a608

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.6-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.6-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 f880d84fdff13200033832771062cc15b4e5007033d6be430b9ff4c7f81f40c5
MD5 a941880cc5ac5f818d8efd054aa9d700
BLAKE2b-256 8d33d74b04209f05b6b860dcc1c9229128a196ea2f8b8f768854b0a47229284e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page