ffpa-attn

FFPA: Yet another Faster Flash Prefill Attention for large headdim, 1.5~3x faster than SDPA.

These details have not been verified by PyPI

Project links

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑

FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.5~3x 🎉 faster than SDPA. 📚👇The Core features:

Self Attn	GQA/MQA	Cross Attn	Causal/Mask	Dropout	Headdim	Fwd/Bwd
✔️(`Nq=Nkv`)	✔️(`Hq!=Hkv`)	✔️(`Nq!=Nkv`)	✔️(`attn_mask`)	✔️(`p>0`)	320~1024	1.5~3x↑

📖 Quick Start

First, install the prebuilt package from PyPI or build ffpa-attn from source:

# Fisrt, install the prebuilt package from PyPI
pip3 install -U ffpa-attn # CUDA 13.0+, PyTorch 2.11+
# Or, build ffpa-attn from source, just follow the cmds
git clone https://github.com/xlite-dev/ffpa-attn.git
# Then, build the wheel package (Triton + CuTeDSL backends)
cd ffpa-attn && pip3 install -e . --no-build-isolation
# Optional: install ffpa-attn w/ CUDA backend (forward only)
ENABLE_FFPA_CUDA_IMPL=1 MAX_JOBS=32 pip3 install -e .

Then, try to accelerate the attention for large headdim with just one-line of code:

>>> import torch.nn.functional as F
>>> from ffpa_attn import ffpa_attn_func
>>> # Monkey-patch SDPA to point to FFPA. Every thing that FFPA
>>> # does not support will auto fallback to SDPA: D <= 256, etc.
>>> F.scaled_dot_product_attention = ffpa_attn_func # one-line code

For more advanced features, please refer to our online docs at 📘ffpa-attn.io.

📖 Split-D

We extend FlashAttention to support large headdim ($D>256$) via fine-grained tiling at the MMA level for $QK^\top$ and $PV$ matrix multiplication, referred to as Split-D. This design keeps SRAM usage fixed at $B_r \times 16$ (with $B_r=B_c$) for Q, K and V, yielding constant SRAM complexity $O(B_r \times 16) \approx O(1)$ and register complexity $O(d/4)$.

FFPA enables headdim > 256, and outperforms standard SDPA by 1.5~3x🎉.

[!NOTE] FFPA has been tested on Ampere, Ada, Hopper, and Blackwell architectures (e.g., A30, L20, 4090, H200, 5090), achieves 1.5~3×↑🎉 speedup over SDPA. FFPA is mainly design for prefill and large headdim, and may not be faster than SDPA for 😈 small sequence length (N<512) or small headdim (D<=256).

🎉 Benchmark

Runnable benchmark are provided under bench. The performance benchmarks for the NVIDIA L20 (Ada), NVIDIA Geforce RTX 5090 (Blackwell), NVIDIA H800 PCIE (Hopper), NVIDIA H200 SXM (Hopper, CuTeDSL backend, up to 427 TFLOPS!🎉) with large headdims can be found at bench.

🤖 Backends

FFPA supports multiple backends for the forward and backward pass, including: SDPA (baseline), CUDA (forward only), Triton, and CuTeDSL. The CuTeDSL backend is currently in early stage and has some constraints, but it can achieve up to 427🎉 TFLOPS on H200! Stay tuned for future updates.

Backend	Arch	Fwd	Bwd	Headdim	Autotune	Speedup	Recommend
SDPA	sm>=75	✔	✔	All	❌	1.0x🤗	sm>=75
CUDA	sm>=80	✔	❌	320~1024	❌	1.5x~3x🎉	sm80~89,120
Triton	sm>=80	✔	✔	320~1024	✔	1.5x~5x🎉	sm>=80
CuTeDSL	sm>=80	✔	✔	320~1024	❌	1.5x~2x🎉	sm80~89,120
CuTeDSL	sm90	✔	✔	320~512	❌	3x~6x🎉	sm90

Special thanks to Butterfingrz for contributing to the CuTeDSL backend! Awesome work!🎉

How to use different backends for your own scenario? Users can simply pass the Backend configs (SDPABackend, CUDABackend, TritonBackend or CuTeDSLBackend) to ffpa_attn_func, for example:

>>> from ffpa_attn import ffpa_attn_func, CuTeDSLBackend
>>> # CuTeDSL backend, D=512 scenario, fastest on H200!🎉
>>> o = ffpa_attn_func(q, k, v, backend=CuTeDSLBackend())

Persistent Autotune

Generate device-specific tuned configs for production deployment (currently, Triton only), avoiding per-process autotune cost. The generated JSON is saved under configs dir and automatically loaded when runtime autotune is disabled (the default). See the docs of Triton Autotune for details.

python -m ffpa_attn.autotune --mode max --full-tasks --overwrite # 1 GPU
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 # Multi-GPU (`pip install ray`)
python -m ffpa_attn.autotune --mode max --full-tasks --num-gpus 8 --overwrite

©️License

Apache License 2.0

©️Citations

@misc{ffpa-attn@2025,
  title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
  url={https://github.com/xlite-dev/ffpa-attn.git},
  note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
  author={DefTruth, Butterfingrz},
  year={2025}
}

📖 References

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.21

Jun 5, 2026

This version

0.1.20

Jun 5, 2026

0.1.19

Jun 1, 2026

0.1.18

May 30, 2026

0.1.17

May 29, 2026

0.1.16

May 27, 2026

0.1.15

May 26, 2026

0.1.14

May 25, 2026

0.1.13

May 22, 2026

0.1.12

May 21, 2026

0.1.11

May 19, 2026

0.1.10

May 15, 2026

0.1.9

May 14, 2026

0.1.8

May 13, 2026

0.1.7

May 8, 2026

0.1.6

May 7, 2026

0.1.4

May 7, 2026

0.1.3

May 6, 2026

0.1.2

Apr 22, 2026

0.1.0

Apr 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ffpa_attn-0.1.20-py2.py3-none-any.whl (344.9 kB view details)

Uploaded Jun 5, 2026 Python 2Python 3

ffpa_attn-0.1.20-cp314-cp314-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded Jun 5, 2026 CPython 3.14manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.20-cp313-cp313-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded Jun 5, 2026 CPython 3.13manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.20-cp312-cp312-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded Jun 5, 2026 CPython 3.12manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.20-cp311-cp311-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded Jun 5, 2026 CPython 3.11manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.20-cp310-cp310-manylinux_2_34_x86_64.whl (42.5 MB view details)

Uploaded Jun 5, 2026 CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file ffpa_attn-0.1.20-py2.py3-none-any.whl.

File metadata

Download URL: ffpa_attn-0.1.20-py2.py3-none-any.whl
Upload date: Jun 5, 2026
Size: 344.9 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ffpa_attn-0.1.20-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`c90fd7862283a70ec2e8987dae53c418c9c126f4f2450c0e6b35850bc31d1375`
MD5	`89c8ba21ef3a6fd60bb3b0faec93a986`
BLAKE2b-256	`e823b65a8ea3b7c21c6cae97edad0bd804a0f424e0e5fc0cfb1d76fb25e26892`

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.20-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

Download URL: ffpa_attn-0.1.20-cp314-cp314-manylinux_2_34_x86_64.whl
Upload date: Jun 5, 2026
Size: 42.5 MB
Tags: CPython 3.14, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ffpa_attn-0.1.20-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`825cc44e9cdba69af1a2e8a5132f7cc0ebc31f672ac9a6907c0fc2e786093e77`
MD5	`8c26b4f50015be4495beec705294ff7f`
BLAKE2b-256	`5ecf86bd80f3632c4132e86e9e7185809cccb6783f9e28615ca948f2bc79f199`

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.20-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

Download URL: ffpa_attn-0.1.20-cp313-cp313-manylinux_2_34_x86_64.whl
Upload date: Jun 5, 2026
Size: 42.5 MB
Tags: CPython 3.13, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ffpa_attn-0.1.20-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`22641bda408dd5e1c8b6f5e8668fc2a6d799b5653569bd84a2399fce76e8fece`
MD5	`33777e5b91bdf76d8947f50b356a6765`
BLAKE2b-256	`8b2f7d77542cf7ea0a5bac9dc0c7ceaa8485595b2b98c6a40f99e0a0efecc7b3`

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.20-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

Download URL: ffpa_attn-0.1.20-cp312-cp312-manylinux_2_34_x86_64.whl
Upload date: Jun 5, 2026
Size: 42.5 MB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ffpa_attn-0.1.20-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`822139f2c262dc2768a717f654abc648ebbfa3e3eb11767229d21395fc4e9b0c`
MD5	`70814c3ff999715548381963db3b9bbc`
BLAKE2b-256	`6c83f0b22d39cd8e96f709b546149ac2d5a4063e47faa583cff4cee840bcb9c6`

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.20-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

Download URL: ffpa_attn-0.1.20-cp311-cp311-manylinux_2_34_x86_64.whl
Upload date: Jun 5, 2026
Size: 42.5 MB
Tags: CPython 3.11, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ffpa_attn-0.1.20-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`d6125699d5cc8181d7c9b3db3bb6d25150572034f51bc9631f8cab2d7f2fb82e`
MD5	`7e80dff836389d79875424357c5ce62f`
BLAKE2b-256	`7a291088e89a20afe8902f0c63a2e01edfb4d94b769d40d3fcb87376006e47d3`

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.20-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

Download URL: ffpa_attn-0.1.20-cp310-cp310-manylinux_2_34_x86_64.whl
Upload date: Jun 5, 2026
Size: 42.5 MB
Tags: CPython 3.10, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ffpa_attn-0.1.20-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`2814c00c1bc966415607aae23ad478950cfe86cc288e26eecca725749b61a56e`
MD5	`9e395be15a4029da6cccd7e774139541`
BLAKE2b-256	`62e60f04bbfd360ace674929f2e65faf662ef66737f6a6dac518c3cf548baff8`

See more details on using hashes here.

ffpa-attn 0.1.20

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for large headdim🐑

📖 Quick Start

📖 Split-D

🎉 Benchmark

🤖 Backends

Persistent Autotune

©️License

©️Citations

📖 References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑