Skip to main content

FFPA: Yet another Faster Flash Prefill Attention for large headdim, 1.5~3x faster than SDPA.

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑


FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.5~3x 🎉 faster than SDPA. 📚👇The Core features:

Self Attn GQA/MQA Cross Attn Causal/Mask Dropout Headdim Fwd/Bwd
✔️(Nq=Nkv) ✔️(Hq!=Hkv) ✔️(Nq!=Nkv) ✔️(attn_mask) ✔️(p>0) 320~1024 1.5~3x↑

📖 Quick Start

First, install the prebuilt package from PyPI or build ffpa-attn from source:

# Fisrt, install the prebuilt package from PyPI
pip3 install -U ffpa-attn # (support: sm_{80,...,120})
# Or, build ffpa-attn from source, just follow the cmds
git clone https://github.com/xlite-dev/ffpa-attn.git
# Then, build the wheel package (Triton + CuTeDSL backends)
cd ffpa-attn && pip3 install -e . --no-build-isolation
# Optional: install ffpa-attn w/ CUDA backend (forward only)
ENABLE_FFPA_CUDA_IMPL=1 MAX_JOBS=32 pip3 install -e .

Then, try to accelerate the attention for large headdim with just one-line of code:

>>> import torch.nn.functional as F
>>> from ffpa_attn import ffpa_attn_func
>>> # Monkey-patch SDPA to point to FFPA. Every thing that FFPA
>>> # does not support will auto fallback to SDPA: D <= 256, etc.
>>> F.scaled_dot_product_attention = ffpa_attn_func # one-line code

For more advanced features, please refer to our online docs at 📘ffpa-attn.io.

📖 Split-D

We extend FlashAttention to support large headdim ($D>256$) via fine-grained tiling at the MMA level for $QK^\top$ and $PV$ matrix multiplication, referred to as Split-D. This design keeps SRAM usage fixed at $B_r \times 16$ (with $B_r=B_c$) for Q, K and V, yielding constant SRAM complexity $O(B_r \times 16) \approx O(1)$ and register complexity $O(d/4)$.

FFPA enables headdim > 256, and outperforms standard SDPA by 1.5~3x🎉.

[!NOTE] FFPA has been tested on Ampere, Ada, Hopper, and Blackwell architectures (e.g., A30, L20, 4090, H200, 5090), achieves 1.5~3×↑🎉 speedup over SDPA. FFPA is mainly design for prefill and large headdim, and may not be faster than SDPA for 😈 small sequence length (N<512) or small headdim (D<=256).

🎉 Benchmark

Runnable examples are provided under examples. The performance benchmarks for the NVIDIA L20 (Ada), NVIDIA Geforce RTX 5090 (Blackwell), NVIDIA H800 PCIE (Hopper), NVIDIA H200 SXM (Hopper, CuTeDSL backend, up to 427 TFLOPS!🎉) with large headdims can be found at examples.


🤖 Backends

FFPA supports multiple backends for the forward and backward pass, including: SDPA (baseline), CUDA (forward only), Triton, and CuTeDSL. The CuTeDSL backend is currently in early stage and has some constraints, but it can achieve up to 427🎉 TFLOPS on H200! Stay tuned for future updates.

Backend Arch Fwd Bwd Headdim Autotune Speedup Recommend
SDPA sm>=75 All 1.0x🤗 sm>=75
CUDA sm>=80 320~1024 1.5x~3x🎉 sm80~89,120
Triton sm>=80 320~1024 1.5x~3x🎉 sm>=80
CuTeDSL sm>=80 320~1024 1.5x~2x🎉 sm80~89,120
CuTeDSL sm90 320~512 3x~6x🎉 sm90

Special thanks to Butterfingrz for contributing to the CuTeDSL backend! Awesome work!🎉

How to use different backends for your own scenario? Users can simply pass the Backend configs (SDPABackend, CUDABackend, TritonBackend or CuTeDSLBackend) to ffpa_attn_func, for example:

>>> from ffpa_attn import ffpa_attn_func, CuTeDSLBackend
>>> # CuTeDSL backend, D=512 scenario, fastest on H200!🎉
>>> o = ffpa_attn_func(q, k, v, backend=CuTeDSLBackend())

©️License

Apache License 2.0

©️Citations

@misc{ffpa-attn@2025,
  title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
  url={https://github.com/xlite-dev/ffpa-attn.git},
  note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
  author={DefTruth, Butterfingrz},
  year={2025}
}

📖 References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ffpa_attn-0.1.16-cp314-cp314-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.16-cp313-cp313-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.16-cp312-cp312-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.16-cp311-cp311-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.16-cp310-cp310-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file ffpa_attn-0.1.16-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.16-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 b078f47b78d3d85418bcca52878279805aa0966349de5fa95a742530bf5361ce
MD5 0240c785fad1cc4daa54a6930101eee7
BLAKE2b-256 269a9e03d0c3d85230f0a2a69a8f255180a8a3be287233463c75839b4e7af422

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.16-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.16-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 8e2935c3397bfd21844020df0a7571b03feb19ad62fc51048c9ca9ea174ffc74
MD5 c3503d1afa36a2e346dc2d3ad926d73e
BLAKE2b-256 ad6eecfb9d8aeb4fc13454f1c41e9a7974a59596a7a7f634a34171696d6972a1

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.16-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.16-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 4e59e1410672e50a9a174320bfba054e0210e297ccb25c066628f46b9d41e6a3
MD5 ffb9198455bb2a3ac069fd79ca454237
BLAKE2b-256 b7352dc8a81bb3a8213ffb432d28b6f85988994317828ff76a1abd25d830aa2e

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.16-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.16-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 00886f0dd7787465cd344dc2c46b82d6fde9dd8249f6861c1e7677c38efa28ce
MD5 1c4918b03dc069685ced343c94840a05
BLAKE2b-256 7250ed8ed25d281bab8a6d17ceb80bc064ddd3028be1cbd676a3ffea75883077

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.16-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.16-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 408ebd5daf1215ea7671289fec910c33b51c8904e5ac08d79133ce999565ab92
MD5 4d0cb3e558a74bce309312349d1b6cc0
BLAKE2b-256 cdfc84c5cace9cd28e1c8c9aa1f64e0dd1edb802ac0a9bc617a1345278a424e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page