Skip to main content

FFPA: Yet another Faster Flash Prefill Attention for large headdim, 1.5~3x faster than SDPA.

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑


FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.5~3x 🎉 faster than SDPA. 📚👇The Core features:

Self Attn GQA/MQA Cross Attn Causal/Mask Dropout Headdim Fwd/Bwd
✔️(Nq=Nkv) ✔️(Hq!=Hkv) ✔️(Nq!=Nkv) ✔️(attn_mask) ✔️(p>0) 320~1024 1.5~3x↑

📖 Quick Start

First, install the prebuilt package from PyPI or build ffpa-attn from source:

# Fisrt, install the prebuilt package from PyPI
pip3 install -U ffpa-attn # CUDA 13.0+, PyTorch 2.11+
# Or, build ffpa-attn from source, just follow the cmds
git clone https://github.com/xlite-dev/ffpa-attn.git
# Then, build the wheel package (Triton + CuTeDSL backends)
cd ffpa-attn && pip3 install -e . --no-build-isolation
# Optional: install ffpa-attn w/ CUDA backend (forward only)
ENABLE_FFPA_CUDA_IMPL=1 MAX_JOBS=32 pip3 install -e .

Then, try to accelerate the attention for large headdim with just one-line of code:

>>> import torch.nn.functional as F
>>> from ffpa_attn import ffpa_attn_func
>>> # Monkey-patch SDPA to point to FFPA. Every thing that FFPA
>>> # does not support will auto fallback to SDPA: D <= 256, etc.
>>> F.scaled_dot_product_attention = ffpa_attn_func # one-line code

For more advanced features, please refer to our online docs at 📘ffpa-attn.io.

📖 Split-D

We extend FlashAttention to support large headdim ($D>256$) via fine-grained tiling at the MMA level for $QK^\top$ and $PV$ matrix multiplication, referred to as Split-D. This design keeps SRAM usage fixed at $B_r \times 16$ (with $B_r=B_c$) for Q, K and V, yielding constant SRAM complexity $O(B_r \times 16) \approx O(1)$ and register complexity $O(d/4)$.

FFPA enables headdim > 256, and outperforms standard SDPA by 1.5~3x🎉.

[!NOTE] FFPA has been tested on Ampere, Ada, Hopper, and Blackwell architectures (e.g., A30, L20, 4090, H200, 5090), achieves 1.5~3×↑🎉 speedup over SDPA. FFPA is mainly design for prefill and large headdim, and may not be faster than SDPA for 😈 small sequence length (N<512) or small headdim (D<=256).

🎉 Benchmark

Runnable benchmark are provided under bench. The performance benchmarks for the NVIDIA L20 (Ada), NVIDIA Geforce RTX 5090 (Blackwell), NVIDIA H800 PCIE (Hopper), NVIDIA H200 SXM (Hopper, CuTeDSL backend, up to 427 TFLOPS!🎉) with large headdims can be found at bench.


🤖 Backends

FFPA supports multiple backends for the forward and backward pass, including: SDPA (baseline), CUDA (forward only), Triton, and CuTeDSL. The CuTeDSL backend is currently in early stage and has some constraints, but it can achieve up to 427🎉 TFLOPS on H200! Stay tuned for future updates.

Backend Arch Fwd Bwd Headdim Autotune Speedup Recommend
SDPA sm>=75 All 1.0x🤗 sm>=75
CUDA sm>=80 320~1024 1.5x~3x🎉 sm80~89,120
Triton sm>=80 320~1024 1.5x~5x🎉 sm>=80
CuTeDSL sm>=80 320~1024 1.5x~2x🎉 sm80~89,120
CuTeDSL sm90 320~512 3x~6x🎉 sm90

Special thanks to Butterfingrz for contributing to the CuTeDSL backend! Awesome work!🎉

How to use different backends for your own scenario? Users can simply pass the Backend configs (SDPABackend, CUDABackend, TritonBackend or CuTeDSLBackend) to ffpa_attn_func, for example:

>>> from ffpa_attn import ffpa_attn_func, CuTeDSLBackend
>>> # CuTeDSL backend, D=512 scenario, fastest on H200!🎉
>>> o = ffpa_attn_func(q, k, v, backend=CuTeDSLBackend())

Persistent Autotune

Generate device-specific tuned configs for production deployment (currently, Triton only), avoiding per-process autotune cost. The generated JSON is saved under configs dir and automatically loaded when runtime autotune is disabled (the default). See the docs of Triton Autotune for details.

python -m ffpa_attn.autotune --mode max --full-tasks --overwrite # 1 GPU
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 # Multi-GPU (`pip install ray`)
python -m ffpa_attn.autotune --mode max --full-tasks --num-gpus 8 --overwrite

©️License

Apache License 2.0

©️Citations

@misc{ffpa-attn@2025,
  title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
  url={https://github.com/xlite-dev/ffpa-attn.git},
  note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
  author={DefTruth, Butterfingrz},
  year={2025}
}

📖 References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ffpa_attn-0.1.21-py2.py3-none-any.whl (345.2 kB view details)

Uploaded Python 2Python 3

ffpa_attn-0.1.21-cp314-cp314-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.21-cp313-cp313-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.21-cp312-cp312-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.21-cp311-cp311-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.21-cp310-cp310-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file ffpa_attn-0.1.21-py2.py3-none-any.whl.

File metadata

  • Download URL: ffpa_attn-0.1.21-py2.py3-none-any.whl
  • Upload date:
  • Size: 345.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ffpa_attn-0.1.21-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 22fb72b95653e80025bfe2f4d6dc4768fb10526c4cdc24baa87ab694a538253e
MD5 99fb6b1e1aedea20103a79cdacff78df
BLAKE2b-256 0acb0fdc8cb86063888ac5b2610e2288d9320379718100f1dd2b22e52e1d9584

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.21-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.21-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 cf7b802c7bd712c667dd91bfd6472ede2eb8fa1faab4d1a968cbe88ff3afb508
MD5 29f88f88ee91c68b00bc6c10f840f01f
BLAKE2b-256 03bbd097f2fb0b320b799b493dc29231c0bc54d187fff870fe74795698cc86d2

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.21-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.21-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 b04fb2e48783b6af5bab8b5d9315331540c6ef2460e0faf627b0038f838d013b
MD5 8151df10ee1de2c4b4cbc7ad2a8e57f4
BLAKE2b-256 2c585864091458abf1c6d64503fbe8ad01929e0c7cc6a396e967f9e9a9c3afec

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.21-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.21-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 d79b2cb9b2c4972374ad05418a4e3b6b57ed4c0aafaa42d3750dfbc376e97866
MD5 e190c221a66143620f3a4154d758b184
BLAKE2b-256 b57fe8d7153c798ddce336511220ed6f0d137e25f142bcb82360121473a30ff7

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.21-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.21-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 274ed6ef6b58e1dae5553bdd2d81fc4d32f5af16db751be9d76d1405dbd809b9
MD5 a074bdca3336d7c6b51ea99fbe61653b
BLAKE2b-256 365da9e58f46af3aa9cd0bb162a9dc8e801843cf41e88bd1561fa93169cb14a7

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.21-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.21-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 fd70aaddf87761d1132572bb9a774a1317c7189811733b2215cdca352ed275ef
MD5 d27c98b7f26b356988b6f895774b3db0
BLAKE2b-256 e7d78957d8740bf56d948d98fd9f80b30721516833c6227f6781957339627562

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page