Skip to main content

FFPA: Yet another Faster Flash Prefill Attention for large headdim, 1.8x~3x faster than SDPA EA.

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑

📈L20 ~1.9x↑🎉 | 📈A30 ~1.8x↑🎉 | 📈3080 ~2.9x↑🎉 | 📈4090 ~2.1x↑🎉

FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.8x~3x 🎉 faster than SDPA. Currently, FFPA supports self-attention, cross-attention, grouped/multi-query attention, causal attention with large headdim (D=320~1024). While the standard FlashAttention-2 only support headdim <= 256.

Self Attention Cross/Decode Attention GQA/MQA Attention Causal Attention Headdim
✔️(Nq = Nkv) ✔️(Nq != Nkv) ✔️(Nh_q % Nh_kv == 0) ✔️(causal mask) 32~1024

[!NOTE] FFPA has been tested on Ampere, Ada, Hopper, and Blackwell architectures (e.g., A30, L20, 4090, H200, 5090), achieves 1.8×~3×↑🎉 forward (CUDA) and 1.5×~2.5×↑🎉 backward (Triton w/ autotune) speedup over SDPA for headdim > 256.

📖 Quick Start

First, install the prebuilt package from PyPI or build ffpa-attn from source:

# Required: PyTorch>=2.11.0, CUDA>=13.0, Ubuntu>=22.04
pip3 install -U ffpa-attn # (support: sm_{80,90,...,120})
# Or, build ffpa-attn from source, just follow the cmds:
git clone https://github.com/xlite-dev/ffpa-attn.git
# Then, build the wheel package and install it with pip
cd ffpa-attn && MAX_JOBS=32 python3 setup.py bdist_wheel
# Optional: build ffpa-attn with ccache for faster rebuilds
apt install ccache && bash tools/build_fast.sh bdist_wheel
# Optional: for editable whl, use `pip install -e .` instead.
pip3 install dist/ffpa_attn-*.whl # pip uninstall ffpa-attn -y

Then, try to accelerate your attention computations with just ♥️one line♥️ of code ~

>>> import torch.nn.functional as F
>>> from ffpa_attn import ffpa_attn_func
>>> # Monkey-patch SDPA to point to FFPA attention. Every thing that
>>> # FFPA does not support will automatically fallback to SDPA. For
>>> # example, if the user calls SDPA with headdim <= 256, attn_mask
>>> # not None, and dropout_p > 0.0, it will fallback to the SDPA.
>>> F.scaled_dot_product_attention = ffpa_attn_func

For more advanced features, please refer to our online docs at 📘ffpa-attn.io.

📖 Split-D

We have extended FlashAttention for large headdim (D > 256) by implementing Fine-grained Tiling at the MMA level (GEMM style) for the Q@K^T and P@V matmul (namely, Split-D). This approach results in a constant SRAM usage of Br * 16 or Bc * 16 (Br = Bc) for Q, K, and V, leading to an overall SRAM complexity of O(Br * 16) ≈ O(1) and a register complexity of O(d/4). Consequently, this method allows us to extend headdim > 256 and achieve faster performance compared to SDPA with or without MMA Accumulation F32 (1.8x~3x 🎉 faster than SDPA EA).

We have named this new attention tiling technique FFPA: Faster Flash Prefill Attention. FFPA does not introduce any additional VRAM requirement, so the HBM memory complexity remains the same as FlashAttention.

By leveraging this approach, we can achieve better performance than SDPA EA for very large headdim (D > 256, FA-2 not supported). Approximate SRAM and register complexity analysis for FFPA is as follows: (d=headdim, C,Br,Bc=Constant, Br=Bc, let O(C)≈O(1)) 👇

📚Complexity Analysis 📚FFPA Attention (Split-D) 📚FlashAttention-2
SRAM O(2xBrx16)≈O(1) ≈O(3xBrxd), d↑
Register ≈O(d/4), d↑ ≈O(d/2), d↑
HBM ≈FA2≈O(Nd), O ≈O(Nd), O
Extra HBM ≈FA2≈O(N), m,l ≈O(N), m,l

🎉 Benchmark

Runnable examples are provided under examples. The performance benchmark for the NVIDIA RTX 4090 with large headdim (D=320~1024) is shown below, where FFPA achieves up to 2.1x 🎉 faster than SDPA. For more comprehensive benchmarks, please refer to our benchmark.

🤔 Why not TMA?

FFPA ships an experimental SM90 TMA path (enable_tma=True) that replaces the K/V cp.async global-to-shared transfer with cp.async.bulk.tensor.2d. After tuning (K SWIZZLE_128B, 64-col TMA box) it reaches parity with the cp.async baseline, but does not beat it.

FFPA's Split-D dataflow is a TMA anti-pattern. TMA wins when single thread instruction can amortise its dispatch cost over a large box, but split-D gives it narrow Bc x kMmaAtomK slices. It would require a major redesign (super-tiled K/V on TMA + warp-specialized WGMMA), rather than a drop-in K/V replacement.

©️License

Apache License 2.0

🎉Contribute

How to contribute? Wecome to star⭐️ this repo to support me👆🏻 ~

©️Citations

@misc{ffpa-attn@2025,
  title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
  url={https://github.com/xlite-dev/ffpa-attn.git},
  note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
  author={DefTruth},
  year={2025}
}

📖 References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ffpa_attn-0.1.4-cp314-cp314-manylinux_2_34_x86_64.whl (55.8 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.4-cp313-cp313-manylinux_2_34_x86_64.whl (55.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.4-cp312-cp312-manylinux_2_34_x86_64.whl (55.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.4-cp311-cp311-manylinux_2_34_x86_64.whl (55.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

ffpa_attn-0.1.4-cp310-cp310-manylinux_2_34_x86_64.whl (55.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file ffpa_attn-0.1.4-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.4-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 08bd4fade61b22427bce48b8e6e055f9bdb76d744a90d61c79af54aa9ca01aca
MD5 a42f31dc67035d5a398930c341e9d458
BLAKE2b-256 5fe77580b9a99bbc075febe63af76cf197b0a36a0e540e23c8613aeab529b929

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.4-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.4-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 3776573ed33dc4dcbbc0b792aeb63b06f106570b88cf0f006e0f4a572b1b05bd
MD5 38bf8b9b3b889d232324b1b9183bcceb
BLAKE2b-256 8348531654ad9aae26261d687667e6d7b7f0a82703081d359d04c50869dda480

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.4-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.4-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 07bc04515fea7d86e427d738040f99ebd84e98d9e2c3df05059d59dd7da00e59
MD5 bba2d3331f739ca404d845cca97eabed
BLAKE2b-256 d4fa1a9a2685d6b914c8303ab99e5d62c88b7c21c0138a34fb2539458b6d73fb

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.4-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.4-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 77643b291b3b17522f493c5862ee7578eac66babe50194011518bd3d5304f19a
MD5 11997af85237ff09f8a70e057708e519
BLAKE2b-256 147fd8ef9e646c07e0d0da14c01705c46404c2047e646bd432ebc48f1cd84924

See more details on using hashes here.

File details

Details for the file ffpa_attn-0.1.4-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ffpa_attn-0.1.4-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 21b68a0ec2f778c0498314ff6d23da377bbaedc5e2324edfc942c5f2f18e9fea
MD5 3e6c4752bdc77c880baee2cfed30f292
BLAKE2b-256 53c9a2848caaa40e50d435aebc7b29e99305b092781955384c431612ea5f371c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page