Skip to main content

FFPA: Yet another Faster Flash Prefill Attention for large headdim, 1.5~3x faster than SDPA.

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑


FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.5~3x 🎉 faster than SDPA. 📚👇The Core features:

Self Attn GQA/MQA Cross Attn Causal/Mask Dropout Headdim Fwd/Bwd
✔️(Nq=Nkv) ✔️(Hq!=Hkv) ✔️(Nq!=Nkv) ✔️(attn_mask) ✔️(p>0) 320~1024 1.5~3x↑

📖 Quick Start

First, install the prebuilt package from PyPI or build ffpa-attn from source:

# Fisrt, install the prebuilt package from PyPI
pip3 install -U ffpa-attn # (support: sm_{80,...,120})
# Or, build ffpa-attn from source, just follow the cmds
git clone https://github.com/xlite-dev/ffpa-attn.git
# Then, build the wheel package (Triton + CuTeDSL backends)
cd ffpa-attn && pip3 install -e . --no-build-isolation
# Optional: install ffpa-attn w/ CUDA backend (forward only)
ENABLE_FFPA_CUDA_IMPL=1 MAX_JOBS=32 pip3 install -e .

Then, try to accelerate the attention for large headdim with just one-line of code:

>>> import torch.nn.functional as F
>>> from ffpa_attn import ffpa_attn_func
>>> # Monkey-patch SDPA to point to FFPA. Every thing that FFPA
>>> # does not support will auto fallback to SDPA: D <= 256, etc.
>>> F.scaled_dot_product_attention = ffpa_attn_func # one-line code

For more advanced features, please refer to our online docs at 📘ffpa-attn.io.

📖 Split-D

We extend FlashAttention to support large headdim ($D>256$) via fine-grained tiling at the MMA level for $QK^\top$ and $PV$ matrix multiplication, referred to as Split-D. This design keeps SRAM usage fixed at $B_r \times 16$ (with $B_r=B_c$) for Q, K and V, yielding constant SRAM complexity $O(B_r \times 16) \approx O(1)$ and register complexity $O(d/4)$.

FFPA enables headdim > 256, and outperforms standard SDPA by 1.5~3x🎉.

[!NOTE] FFPA has been tested on Ampere, Ada, Hopper, and Blackwell architectures (e.g., A30, L20, 4090, H200, 5090), achieves 1.5~3×↑🎉 speedup over SDPA. FFPA is mainly design for prefill and large headdim, and may not be faster than SDPA for 😈 small sequence length (N<512) or small headdim (D<=256).

🎉 Benchmark

Runnable examples are provided under examples. The performance benchmarks for the NVIDIA L20 (Ada), NVIDIA Geforce RTX 5090 (Blackwell), NVIDIA H800 PCIE (Hopper), NVIDIA H200 SXM (Hopper, CuTeDSL backend, up to 427 TFLOPS!🎉) with large headdims can be found at examples.


🤖 Backends

FFPA supports multiple backends for the forward and backward pass, including: SDPA (baseline), CUDA (forward only), Triton, and CuTeDSL. The CuTeDSL backend is currently in early stage and has some constraints, but it can achieve up to 427🎉 TFLOPS on H200! Stay tuned for future updates.

Backend Arch Fwd Bwd Headdim Autotune Speedup Recommend
SDPA sm>=75 All 1.0x🤗 sm>=75
CUDA sm>=80 320~1024 1.5x~3x🎉 sm80~89,120
Triton sm>=80 320~1024 1.5x~5x🎉 sm>=80
CuTeDSL sm>=80 320~1024 1.5x~2x🎉 sm80~89,120
CuTeDSL sm90 320~512 3x~6x🎉 sm90

Special thanks to Butterfingrz for contributing to the CuTeDSL backend! Awesome work!🎉

How to use different backends for your own scenario? Users can simply pass the Backend configs (SDPABackend, CUDABackend, TritonBackend or CuTeDSLBackend) to ffpa_attn_func, for example:

>>> from ffpa_attn import ffpa_attn_func, CuTeDSLBackend
>>> # CuTeDSL backend, D=512 scenario, fastest on H200!🎉
>>> o = ffpa_attn_func(q, k, v, backend=CuTeDSLBackend())

©️License

Apache License 2.0

©️Citations

@misc{ffpa-attn@2025,
  title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
  url={https://github.com/xlite-dev/ffpa-attn.git},
  note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
  author={DefTruth, Butterfingrz},
  year={2025}
}

📖 References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ffpa_attn-0.1.17-cp314-cp314-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.17-cp313-cp313-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.17-cp312-cp312-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.17-cp311-cp311-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.17-cp310-cp310-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file ffpa_attn-0.1.17-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.17-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 4b56117bb260e1241bfa140d12b97dc6cd5d044a44e4919f2b4327b58f76c546
MD5 8ef8219e99ce0f8a1451ef05b0def3a4
BLAKE2b-256 09c0f025f8d89f00cb25cdd5ec4cc83c0a2c1bcfa51ab0ea8ff480bf00670c00

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.17-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.17-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 4ce80e491560fb458309b7acca43614fd39d6172d8ef3565fb3cc1e78f8cccf8
MD5 0e8f3443df5bc152001bd0c71b148e1b
BLAKE2b-256 38488b9fdf93a2018511ceef7b88ca19e9298313691556db4c036c96ddfaf36e

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.17-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.17-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 e6ae3c9768cdd80e8cea0c99b568bad3cbefa1bd943d18cd55d4170fed0b4ce8
MD5 e41fdd2230f7f4b2475f2abe8722c891
BLAKE2b-256 f1401dcb15c1afe932cf0efa017bb37b562e92ecfa157d4cf4a06bcaa5fbfdea

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.17-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.17-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 03b610c9542b3b7c2a4f4c36d672693e8fe3b8452c47b93664f9aadbf9db3ec7
MD5 08fb53b0dd19732e7fc775d784b8a875
BLAKE2b-256 121dbf95d695c0c494ec5f2011ec65d0df0f18df64f7d9035aec2a0f1e9da94e

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.17-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.17-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 4ee769e17e900e8c0f67a2e50fb6acd0f28203a510bc935bb30e9df5214dd693
MD5 e6185ed6cfe6a9e789b3e843784085fb
BLAKE2b-256 f73813f102a9a77b97057f200bb232e137fef9363823ebb2aa3e9bf43a99bd8e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page