ffpa-attn

FFPA: Yet another Faster Flash Prefill Attention for large headdim, 1.8x~3x faster than SDPA EA.

These details have not been verified by PyPI

Project links

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑

📚FFPA(Split-D) Blog | 📈L20 ~1.9x↑🎉 | 📈A30 ~1.8x↑🎉 | 📈3080 ~2.9x↑🎉 | 📈4090 ~2.1x↑🎉

FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.8x~3x 🎉 faster than SDPA. Currently, FFPA supports self-attention, cross-attention, grouped/multi-query attention, causal attention with large headdim (D=320~1024). While the standard FlashAttention-2 only support headdim <= 256.

Self Attention	Cross/Decode Attention	GQA/MQA Attention	Causal Attention	Headdim
✔️(`Nq = Nkv`)	✔️(`Nq != Nkv`)	✔️(`Nh_q % Nh_kv == 0`)	✔️(`causal mask`)	32~1024

[!NOTE] FFPA has been tested on Ampere, Ada, Hopper, and Blackwell architectures (e.g., A30, L20, 4090, H200, 5090). For Hopper and Blackwell, it still delivers a 1.5×–2.3×↑🎉 speedup over SDPA for headdim > 256.

📖 Quick Start

First, install the prebuilt package from PyPI (required: PyTorch>=2.11.0, CUDA>=13.0):

pip3 install -U ffpa-attn # (support: sm_80, sm_89, sm_90, sm_100, sm_120)

Or, you can build ffpa-attn from source (recommended: PyTorch>=2.11.0, CUDA>=13.0):

git clone https://github.com/xlite-dev/ffpa-attn.git
# Then, build the wheel package and install it with pip
cd ffpa-attn && MAX_JOBS=32 python3 setup.py bdist_wheel
# Optional: build ffpa-attn with ccache for faster rebuilds
apt install ccache && bash tools/build_fast.sh bdist_wheel
# Optional: for editable whl, use `pip install -e .` instead.
pip3 install dist/ffpa_attn-*.whl # pip uninstall ffpa-attn -y

[!NOTE] FFPA supports cross-attention where the query seqlen Nq may differ from the key/value seqlen Nkv, GQA / MQA attention where Q has Nh_q heads and K/V have Nh_kv heads (requires Nh_q % Nh_kv == 0; group size = Nh_q / Nh_kv), and causal attention (pass causal=True; queries are aligned to the KV tail, i.e. Q row r attends to k <= r + (Nkv - Nq), which requires Nkv >= Nq). K/V must share the same Nh_kv and Nkv.

Minimal usage example — Self-Attention (B=1, H=32, N=8192, D=512):

import torch
import torch.nn.functional as F
from ffpa_attn import ffpa_attn_func

# D: 32, 64, ..., 320, ..., 1024 (FA-2 <= 256, FFPA supports up to 1024).
B, H, N, D = 1, 32, 8192, 512 # batch_size, num_heads, seq_len, head_dim
q = torch.randn(B, H, N, D, dtype=torch.bfloat16, device="cuda")
k = torch.randn(B, H, N, D, dtype=torch.bfloat16, device="cuda")
v = torch.randn(B, H, N, D, dtype=torch.bfloat16, device="cuda")

# FFPA self attention; layout follows SDPA: (B, H, N, D).
out = ffpa_attn_func(q, k, v)  # -> torch.Tensor of shape (B, H, N, D)
print(out.shape, out.dtype)

ref = F.scaled_dot_product_attention(q, k, v)
print(f"vs SDPA max_abs_err={(out - ref).abs().max().item():.4e}")

Cross-Attention or Decoding-Attention example (short query, long KV cache; Nq != Nkv):

import torch
import torch.nn.functional as F
from ffpa_attn import ffpa_attn_func

# Short-query / long-KV, e.g. incremental decoding or cross-attention:
# Q: [B, H, Nq, D], K/V: [B, H, Nkv, D]; Nq can differ from Nkv but Nk==Nv required.
B, H, D = 1, 8, 512
Nq, Nkv = 128, 8192
q = torch.randn(B, H, Nq,  D, dtype=torch.bfloat16, device="cuda")
k = torch.randn(B, H, Nkv, D, dtype=torch.bfloat16, device="cuda")
v = torch.randn(B, H, Nkv, D, dtype=torch.bfloat16, device="cuda")

out = ffpa_attn_func(q, k, v)  # -> (B, H, Nq, D) = (1, 8, 128, 512)
print(out.shape, out.dtype)

ref = F.scaled_dot_product_attention(q, k, v)
print(f"vs SDPA max_abs_err={(out - ref).abs().max().item():.4e}")

Grouped-Query / Multi-Query Attention example (Q has more heads than K/V):

import torch
import torch.nn.functional as F
from ffpa_attn import ffpa_attn_func

# GQA: Q has Nh_q heads, K/V share Nh_kv heads; group_size = Nh_q / Nh_kv.
# Typical Llama-3-style 32/8 ratio; MQA is the Nh_kv==1 special case.
# FFPA targets large headdim so we use D=512 here (FA-2 tops out at D=256).
B, D, Nq, Nkv = 1, 512, 1024, 4096
Nh_q, Nh_kv = 32, 8  # group_size = 4
q = torch.randn(B, Nh_q,  Nq,  D, dtype=torch.bfloat16, device="cuda")
k = torch.randn(B, Nh_kv, Nkv, D, dtype=torch.bfloat16, device="cuda")
v = torch.randn(B, Nh_kv, Nkv, D, dtype=torch.bfloat16, device="cuda")

out = ffpa_attn_func(q, k, v)  # -> (B, Nh_q, Nq, D) = (1, 32, 1024, 512)
print(out.shape, out.dtype)

# Reference: replicate K/V along head dim to match Q's head count.
group_size = Nh_q // Nh_kv
k_ref = k.repeat_interleave(group_size, dim=1)
v_ref = v.repeat_interleave(group_size, dim=1)
ref = F.scaled_dot_product_attention(q, k_ref, v_ref)
print(f"vs SDPA max_abs_err={(out - ref).abs().max().item():.4e}")

Causal Attention example (self-attention causal; also supports chunked / decoding prefill with Nkv > Nq):

import torch
import torch.nn.functional as F
from ffpa_attn import ffpa_attn_func

# Causal self-attention: Q row r attends to k <= r (standard triangular mask).
# FFPA is tuned for large headdim, so we keep D=512 as in the self-attn example.
B, H, N, D = 1, 8, 4096, 512
q = torch.randn(B, H, N, D, dtype=torch.bfloat16, device="cuda")
k = torch.randn(B, H, N, D, dtype=torch.bfloat16, device="cuda")
v = torch.randn(B, H, N, D, dtype=torch.bfloat16, device="cuda")

out = ffpa_attn_func(q, k, v, causal=True)
print(out.shape, out.dtype)

ref = F.scaled_dot_product_attention(q, k, v, is_causal=True)
print(f"vs SDPA max_abs_err={(out - ref).abs().max().item():.4e}")

# Chunked / decoding prefill: Nq < Nkv, queries aligned to the KV tail
# so Q row r attends to k <= r + (Nkv - Nq). Requires Nkv >= Nq.
Nq, Nkv = 128, 8192
q = torch.randn(B, H, Nq,  D, dtype=torch.bfloat16, device="cuda")
k = torch.randn(B, H, Nkv, D, dtype=torch.bfloat16, device="cuda")
v = torch.randn(B, H, Nkv, D, dtype=torch.bfloat16, device="cuda")
out = ffpa_attn_func(q, k, v, causal=True)
print(out.shape, out.dtype)  # (1, 8, 128, 512)

A runnable end-to-end example (with self-attn, cross-attn, GQA and causal-attn) is provided under examples. The performance snapshot for the NVIDIA L20 with Headdim=512 is listed below:

Case	dtype	Nq/Nkv	allclose	FFPA / SDPA	speedup
self-attn	fp16	8192/8192	✅	46.7 / 74.7 ms	1.60x
cross-attn	fp16	1024/8192	✅	6.32 / 9.94 ms	1.57x
gqa	fp16	8192/8192	✅	46.4 / 74.8 ms	1.61x
causal	fp16	8192/8192	✅	24.3 / 37.4 ms	1.54x
non-aligned	fp16	8191/8191	✅	12.3 / 19.0 ms	1.55x
self-attn	bf16	8192/8192	✅	46.5 / 74.7 ms	1.61x
cross-attn	bf16	1024/8192	✅	6.29 / 9.95 ms	1.58x
gqa	bf16	8192/8192	✅	46.2 / 74.7 ms	1.62x
causal	bf16	8192/8192	✅	24.2 / 37.5 ms	1.55x
non-aligned	bf16	8191/8191	✅	12.3 / 19.0 ms	1.55x

Env: NVIDIA L20, PyTorch 2.11, CUDA 13.0, Headdim=512 (FA-2 not supported)

📖 Fine-grained Tiling at MMA level

We have extended FlashAttention for large headdim (D > 256) by implementing Fine-grained Tiling at the MMA level (GEMM style) for the Q@K^T and P@V matmul. This approach results in a constant SRAM usage of Br * 16 or Bc * 16 (Br = Bc) for Q, K, and V, leading to an overall SRAM complexity of O(2 * Br * 16) ≈ O(1) and a register complexity of O(d/4). Consequently, this method allows us to extend headdim beyond 256 and achieve faster performance compared to SDPA with or without MMA Accumulation F32 (1.8x~3x 🎉 faster than SDPA EA).

We have named this new attention tiling technique FFPA: Faster Flash Prefill Attention. FFPA does not introduce any additional VRAM requirement, so the HBM memory complexity remains the same as FlashAttention.

By leveraging this approach, we can achieve better performance than SDPA EA for very large headdim (D > 256, FA-2 not supported). Approximate SRAM and register complexity analysis for FFPA is as follows: (d=headdim, C,Br,Bc=Constant, Br=Bc, let O(C)≈O(1)) 👇

📚Complexity Analysis	📚FFPA Attention (Split-D)	📚FlashAttention-2
SRAM	O(2xBrx16)≈O(1)	≈O(3xBrxd), d↑
Register	≈O(d/4), d↑	≈O(d/2), d↑
HBM	≈FA2≈O(Nd), O	≈O(Nd), O
Extra HBM	≈FA2≈O(N), m,l	≈O(N), m,l

📚Implementation: FFPA is implemented using pure MMA PTX instructions, which supports many features such as Split-Q, SMEM Swizzle/Padding, QKV Multi-Stages(1~4), Tile MMAs/Warps, Mixed MMA F32/F16 Acc (Q@K^T MMA Acc F32 + P@V MMA Acc F16), Fully Shared QKV SMEM, Prefetch QKV g2s, Persist Q s2r/g2s, Fully QKV Fine-grained Tiling(GEMM style), Collective Store, etc.

✔️Tensor Cores	✔️MMA(m16n8k16)	✔️Tile Block(Br, Bc)	✔️Tile MMA/Warp
✔️Split Q(FA-2)	✔️Pack LDST(128 bits)	✔️SMEM Swizzle/Pad	✔️Copy Async
✔️Reg Double Buffers	✔️QKV Multi-Stages(1~4)	✔️Collective Store(Shfl)	✔️Prefetch QKV g2s
✔️QKV Fine-grained Tiling	✔️Shared QKV SMEM	✔️Mixed MMA Acc	✔️Persist Q s2r/g2s

🤔 Why not TMA?

FFPA ships an experimental SM>=SM90 TMA path (tma=True) that replaces the K/V cp.async global-to-shared transfer with cp.async.bulk.tensor.2d + mbarriers. After tuning (K SWIZZLE_128B, 64-col TMA box, decoupled Q/K stage cadence) it reaches parity with the cp.async baseline on D=512, but does not beat it.

The reason is structural: FFPA's split-D dataflow is a TMA anti-pattern. TMA wins when single thread instruction can amortise its descriptor + mbarrier + queue cost over a large box, but split-D gives it narrow Bc x kMmaAtomK slices, while cp.async already saturates the same bytes in parallel from all 256 threads in the CTA.

©️License

Apache License 2.0

🎉Contribute

How to contribute? Wecome to star⭐️ this repo to support me👆🏻 ~

©️Citations

@misc{ffpa-attn@2025,
  title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
  url={https://github.com/xlite-dev/ffpa-attn.git},
  note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
  author={DefTruth},
  year={2025}
}

📖 References

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.21

Jun 5, 2026

0.1.20

Jun 5, 2026

0.1.19

Jun 1, 2026

0.1.18

May 30, 2026

0.1.17

May 29, 2026

0.1.16

May 27, 2026

0.1.15

May 26, 2026

0.1.14

May 25, 2026

0.1.13

May 22, 2026

0.1.12

May 21, 2026

0.1.11

May 19, 2026

0.1.10

May 15, 2026

0.1.9

May 14, 2026

0.1.8

May 13, 2026

0.1.7

May 8, 2026

0.1.6

May 7, 2026

0.1.4

May 7, 2026

0.1.3

May 6, 2026

This version

0.1.2

Apr 22, 2026

0.1.0

Apr 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ffpa_attn-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl (55.4 MB view details)

Uploaded Apr 22, 2026 CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file ffpa_attn-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

Download URL: ffpa_attn-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl
Upload date: Apr 22, 2026
Size: 55.4 MB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ffpa_attn-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`29788e9831bc63a69c6d1970c169d8c099cf0c41a33491d99a7d5e23ba60d3e4`
MD5	`d5ee7a1c09c70197ddeb45bfa0a68e22`
BLAKE2b-256	`d2504ffedd48fc9fbb27786fe1969d4483716a9899ac198d12d97a883364305d`

See more details on using hashes here.

ffpa-attn 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑

📖 Quick Start

📖 Fine-grained Tiling at MMA level

🤔 Why not TMA?

©️License

🎉Contribute

©️Citations

📖 References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

ffpa-attn 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

🤖FFPA: Yet another Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for large headdim🐑

📖 Quick Start

📖 Fine-grained Tiling at MMA level

🤔 Why not TMA?

©️License

🎉Contribute

©️Citations

📖 References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑