Skip to main content

FFPA: Yet another Faster Flash Prefill Attention for large headdim, 1.8x~3x faster than SDPA EA.

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑

📈L20 ~1.9x↑🎉 | 📈A30 ~1.8x↑🎉 | 📈3080 ~2.9x↑🎉 | 📈4090 ~2.1x↑🎉

FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.8~3x 🎉 faster than SDPA. 👇Core features:

Self Attn GQA/MQA Cross Attn Causal Attn Headdim Forward↑ Backward↑
✔️(Nq=Nkv) ✔️(Hq!=Hkv) ✔️(Nq!=Nkv) ✔️(causal) 320~1024 1.8~3x↑🎉 1.5~2.5x↑🎉

📖 Quick Start

First, install the prebuilt package from PyPI or build ffpa-attn from source:

# Fisrt, install the prebuilt package from PyPI
pip3 install -U ffpa-attn # (support: sm_{80,...,120})
# Or, build ffpa-attn from source, just follow the cmds
git clone https://github.com/xlite-dev/ffpa-attn.git
# Then, build the wheel package (Triton backend only)
cd ffpa-attn && pip3 install -e . --no-build-isolation
# Optional: build the whl with Triton and CUDA backends
ENABLE_FFPA_FWD_CUDA_IMPL=1 && MAX_JOBS=32 pip3 install -e .

Then, try to accelerate the attention for large headdim with just one-line of code:

>>> import torch.nn.functional as F
>>> from ffpa_attn import ffpa_attn_func
>>> # Monkey-patch SDPA to point to FFPA attention. Every thing that
>>> # FFPA does not support will automatically fallback to SDPA. For
>>> # example, if the user calls SDPA with headdim <= 256 or > 1024,
>>> # attn_mask not None, dropout_p > 0.0, and N < 512, etc.
>>> F.scaled_dot_product_attention = ffpa_attn_func # one-line code

For more advanced features, please refer to our online docs at 📘ffpa-attn.io.

📖 Split-D

We extend FlashAttention to support large headdim ($D>256$) via fine-grained tiling at the MMA level for $QK^\top$ and $PV$ matrix multiplication, referred to as Split-D. This design keeps SRAM usage fixed at $B_r \times 16$ (with $B_r=B_c$) for Q, K and V, yielding constant SRAM complexity $O(B_r \times 16) \approx O(1)$ and register complexity $O(d/4)$.

FFPA enables headdim > 256, and outperforms standard SDPA by 1.8~3x🎉.

[!NOTE] FFPA has been tested on Ampere, Ada, Hopper, and Blackwell architectures (e.g., A30, L20, 4090, H200, 5090), achieves 1.8~3×↑🎉 forward and 1.5~2.5×↑🎉 backward padd speedup over SDPA. Currently, FFPA is mainly design for prefill (N>=512) and large headdim (D>256), and may not be faster than SDPA for 😈 small sequence length (N<512) or small headdim (D<=256).

🎉 Benchmark

Runnable examples are provided under examples. The performance benchmark for the 4090 with large headdim (D=320~1024) is shown below. Please refer to our bench for more details.

©️License

Apache License 2.0

©️Citations

@misc{ffpa-attn@2025,
  title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
  url={https://github.com/xlite-dev/ffpa-attn.git},
  note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
  author={DefTruth},
  year={2025}
}

📖 References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ffpa_attn-0.1.7-cp314-cp314-manylinux_2_34_x86_64.whl (56.8 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.7-cp313-cp313-manylinux_2_34_x86_64.whl (56.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.7-cp312-cp312-manylinux_2_34_x86_64.whl (56.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.7-cp311-cp311-manylinux_2_34_x86_64.whl (56.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.7-cp310-cp310-manylinux_2_34_x86_64.whl (56.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file ffpa_attn-0.1.7-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.7-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 1bb1784cfff1a85c10100e1ec330e0eea3ec2943af479abf1d80875b50e1a503
MD5 a19f9005611d3b011ba2876cc83d7f3c
BLAKE2b-256 b737bfb245b2fa6d72b9a01fb80a37d04170c2502af516fcfdc87dbcf13595ae

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.7-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.7-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 7ce4d6e7893185860a0b5c534915ed917eb420b490bc4104a82d0abdf45c2ec9
MD5 4124d5a790f35abfb30bf6af125f1aa7
BLAKE2b-256 b5ec345afcf7d68181daa554f16e67b61122cc24e4cdef01cd83723f63180e30

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.7-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.7-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 aa768003f0c9ea2b087ed470d4d5c7faaad74fd8221b5c7c3785bd123da79244
MD5 fb67aa0fcc425c223cceb368be31879a
BLAKE2b-256 2b0b3e1f549847d2a36bfbb6aa68d07493a6a4485fae462a82a3a4babbc9bbe9

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.7-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.7-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 f3c946aaec0479c33cb66acecd9ae36bf09ccedb82dd2d556582fa505675a4a8
MD5 94c924fe5fc6cedecb117a1129b03fe5
BLAKE2b-256 59392dcf84023737dbc71f78ab8e8d07a7d9b944b6325e720354091a4be88363

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.7-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.7-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 e11fb1bb366d86b0b6779d6f92efcc8cb57faa66012299d286709ae9aaab08c8
MD5 ce719ebbe55e708d364bd795925c1fb0
BLAKE2b-256 c84c7eac8bb62fbfcaf147a531d4b287644eea38d53002f80e2dc6937301b3b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page