Skip to main content

FFPA: Yet another Faster Flash Prefill Attention for large headdim, 1.5~3x faster than SDPA.

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑


FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.5~3x 🎉 faster than SDPA. 📚👇The Core features:

Self Attn GQA/MQA Cross Attn Causal/Mask Dropout Headdim Fwd/Bwd
✔️(Nq=Nkv) ✔️(Hq!=Hkv) ✔️(Nq!=Nkv) ✔️(attn_mask) ✔️(p>0) 320~1024 1.5~3x↑

📖 Quick Start

First, install the prebuilt package from PyPI or build ffpa-attn from source:

# Fisrt, install the prebuilt package from PyPI
pip3 install -U ffpa-attn # (support: sm_{80,...,120})
# Or, build ffpa-attn from source, just follow the cmds
git clone https://github.com/xlite-dev/ffpa-attn.git
# Then, build the wheel package (Triton backend only)
cd ffpa-attn && pip3 install -e . --no-build-isolation
# Optional: install ffpa-attn with CuTeDSL backend
pip3 install -e ".[cutedsl]" --no-build-isolation

Then, try to accelerate the attention for large headdim with just one-line of code:

>>> import torch.nn.functional as F
>>> from ffpa_attn import ffpa_attn_func
>>> # Monkey-patch SDPA to point to FFPA. Every thing that FFPA
>>> # does not support will auto fallback to SDPA: D <= 256, etc.
>>> F.scaled_dot_product_attention = ffpa_attn_func # one-line code

For more advanced features, please refer to our online docs at 📘ffpa-attn.io.

📖 Split-D

We extend FlashAttention to support large headdim ($D>256$) via fine-grained tiling at the MMA level for $QK^\top$ and $PV$ matrix multiplication, referred to as Split-D. This design keeps SRAM usage fixed at $B_r \times 16$ (with $B_r=B_c$) for Q, K and V, yielding constant SRAM complexity $O(B_r \times 16) \approx O(1)$ and register complexity $O(d/4)$.

FFPA enables headdim > 256, and outperforms standard SDPA by 1.5~3x🎉.

[!NOTE] FFPA has been tested on Ampere, Ada, Hopper, and Blackwell architectures (e.g., A30, L20, 4090, H200, 5090), achieves 1.5~3×↑🎉 speedup over SDPA. FFPA is mainly design for prefill and large headdim, and may not be faster than SDPA for 😈 small sequence length (N<512) or small headdim (D<=256).

🎉 Benchmark

Runnable examples are provided under examples. The performance benchmarks for the NVIDIA L20 (Ada), NVIDIA Geforce RTX 5090 (Blackwell), NVIDIA H800 PCIE (Hopper), NVIDIA H200 SXM (Hopper, CuTeDSL backend, up to 427 TFLOPS!🎉) with large headdim are shown below:




🤖 Backends

FFPA supports multiple backends for the forward and backward pass, including: SDPA (baseline), CUDA (forward only), Triton, and CuTeDSL. The CuTeDSL backend is currently in early stage and has some constraints (e.g., D <= 512), but it can achieve up to 427🎉 TFLOPS on H200! Stay tuned for future updates.

Backend Arch Fwd Bwd Headdim Autotune Speedup Recommend
SDPA Ampere+ All 1.0x Ampere+
CUDA Ampere+ 320~1024 1.5x~3x🎉 Ampere, Ada
Triton Ampere+ 320~1024 1.5x~3x🎉 Ampere+
CuTeDSL Hopper 320~512 3x~6x🎉 Hopper

Special thanks to Butterfingrz for contributing to the CuTeDSL backend! Awesome work!🎉

How to use different backends for your own scenario? Users can simply pass the Backend configs (SDPABackend, CUDABackend, TritonBackend or CuTeDSLBackend) to ffpa_attn_func, for example:

>>> from ffpa_attn import ffpa_attn_func, CuTeDSLBackend
>>> # CuTeDSL backend, D=512 scenario, fastest on H200!🎉
>>> o = ffpa_attn_func(q, k, v, backend=CuTeDSLBackend())

©️License

Apache License 2.0

©️Citations

@misc{ffpa-attn@2025,
  title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
  url={https://github.com/xlite-dev/ffpa-attn.git},
  note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
  author={DefTruth, Butterfingrz},
  year={2025}
}

📖 References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ffpa_attn-0.1.13-cp314-cp314-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.13-cp313-cp313-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.13-cp312-cp312-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.13-cp311-cp311-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.13-cp310-cp310-manylinux_2_34_x86_64.whl (41.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file ffpa_attn-0.1.13-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.13-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 689eed26c02af2e1b5c7f20f342b6bc5c5282ea97423e5f37c6d91dd93af08c6
MD5 3cbe413749b15c856f8df259f99c91f7
BLAKE2b-256 a9ba8bf4a6f7e43465bf0bc85a7a65f3e5c12eedfc8540e91e35f1c2b321e085

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.13-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.13-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 bdee5b65be6c21f5f8a600211af90556d37aa8be17c8b3e035bfa680ee54d6e7
MD5 928e896426703e0243f3e4c83f338383
BLAKE2b-256 4c3cba1f0e71c76a0c05075fb1fda6b75d4b8a44969c25e0e99c39506a41e7dc

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.13-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.13-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 e43fbe49b0f3def4ada75ae3fcbe57ccfa56b0c64010316f465575e7b18526e6
MD5 98484974a2d719ff4cda46ac2c72cc98
BLAKE2b-256 feefad7423e58614afcbd1e02a9e9d37d901c69d75b4a1a33220d9c4ff446d15

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.13-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.13-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 c95c010a77edbd8333b39223a06b1521358798559988ff82b53ac36fea91534d
MD5 c7ea5caeb10cf65a3a22e582788fa741
BLAKE2b-256 1058707029bda50c39b39462301d29aff24d4e2b6a9bd87b28f6bea07aa13fd7

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.13-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.13-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 68dfdef064754f912a630188b072513c0062cf7003b9b7b71d63d70d92506331
MD5 4568ff446c2e7d65296d3b28d10684cd
BLAKE2b-256 d9d486ec698378801721da714c5a0e31bfbbd998b4978039a3d2d59be78bcc1e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page