Flash Attention for PyTorch on Apple Silicon (M1/M2/M3/M4)

These details have not been verified by PyPI

Project links

Project description

MPS Flash Attention

Flash Attention for PyTorch on Apple Silicon (M1/M2/M3/M4).

O(N) memory instead of O(N²), enabling 100K+ sequence lengths on unified memory.

Performance

Benchmarked on Apple Silicon (M1/M2/M3/M4):

Seq Length	vs PyTorch SDPA	Notes
1024	1.1-2.0x faster	Crossover point
2048	1.7-3.7x faster	Sweet spot
4096	2.0-3.9x faster	Peak performance
8192+	3-4x faster	SDPA often OOMs

Average speedup: 1.8x across all configurations.

Installation

pip install mps-flash-attn

Build from source

git clone --recursive https://github.com/mpsops/mps-flash-attention.git
cd mps-flash-attention

# Build Swift bridge
cd swift-bridge && swift build -c release && cd ..

# Install
pip install -e .

# Set bridge path
export MFA_BRIDGE_PATH=$PWD/swift-bridge/.build/release/libMFABridge.dylib

Usage

Basic Attention

from mps_flash_attn import flash_attention

# (B, H, N, D) format
q = torch.randn(2, 8, 4096, 64, device='mps', dtype=torch.float16)
k = torch.randn(2, 8, 4096, 64, device='mps', dtype=torch.float16)
v = torch.randn(2, 8, 4096, 64, device='mps', dtype=torch.float16)

out = flash_attention(q, k, v)

Causal Masking

out = flash_attention(q, k, v, is_causal=True)

Sliding Window (Mistral/Llama 3.2)

# Only attend to last 4096 tokens
out = flash_attention(q, k, v, is_causal=True, window_size=4096)

Quantized KV Cache (2-4x memory savings)

from mps_flash_attn import flash_attention_fp8, quantize_kv_fp8

# Quantize K/V to FP8
k_quant, k_scale = quantize_kv_fp8(k)
v_quant, v_scale = quantize_kv_fp8(v)

# Run attention with quantized KV
out = flash_attention_fp8(q, k_quant, v_quant, k_scale, v_scale)

100K+ Long Sequences

from mps_flash_attn import flash_attention_chunked

# Process 100K tokens without OOM
q = torch.randn(1, 8, 100000, 64, device='mps', dtype=torch.float16)
k = torch.randn(1, 8, 100000, 64, device='mps', dtype=torch.float16)
v = torch.randn(1, 8, 100000, 64, device='mps', dtype=torch.float16)

out = flash_attention_chunked(q, k, v, chunk_size=8192)

Drop-in SDPA Replacement

from mps_flash_attn import replace_sdpa

replace_sdpa()  # Patches F.scaled_dot_product_attention

# Now all PyTorch attention uses Flash Attention on MPS

torch.compile() Support

from mps_flash_attn import register_custom_op

register_custom_op()

@torch.compile
def my_attention(q, k, v):
    return torch.ops.mfa.flash_attention(q, k, v, False, None, None)

Training with BF16 Backward

out = flash_attention(q, k, v, bf16_backward=True)  # 2x faster backward
loss = out.sum()
loss.backward()

Benchmarking

# Quick benchmark
python -m mps_flash_attn.benchmark --suite quick

# Full suite with report
python -m mps_flash_attn.benchmark --suite full --output report.html

from mps_flash_attn.benchmark import run_suite, compare_vs_sdpa

results = run_suite(seq_lengths=[1024, 2048, 4096])
compare_vs_sdpa()

Features

Feature	Status	Notes
Forward pass	✅	FP16/BF16/FP32
Backward pass	✅	Full gradient support
Causal masking	✅	Native kernel support
Attention masks	✅	Boolean masks
Sliding window	✅	For local attention models
GQA/MQA	✅	Grouped-query attention
Quantized KV	✅	FP8, INT8, NF4
Chunked attention	✅	100K+ tokens
torch.compile()	✅	Custom op backend
Dropout	❌	Not supported

Architecture

Python API (mps_flash_attn)
         │
    C++ Extension (mps_flash_attn.mm)
         │ dlopen
    Swift Bridge (MFABridge.swift)
         │
    Metal Flash Attention (kernel generation)
         │
    Metal GPU Shaders

Requirements

macOS 14+ (Sonoma) or macOS 15+ (Sequoia)
Apple Silicon (M1/M2/M3/M4)
Python 3.10+
PyTorch 2.0+

TODO / Future Optimizations

Batched kernel dispatch - Currently dispatches B×H separate kernels per attention call. Should use 3D grid to handle all batch/heads in one dispatch (major perf win for small sequences like Swin Transformer windows)
Fused QKV projection + attention - Single kernel from input to output, avoid intermediate buffers
Pre-scaled bias option - Allow passing pre-scaled bias to avoid per-call scaling overhead
LoRA fusion - Fuse adapter weights into attention computation

Credits

metal-flash-attention by Philip Turner
Flash Attention paper by Tri Dao et al.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

Apr 27, 2026

0.5.2

Apr 21, 2026

0.5.1

Feb 13, 2026

0.5.0

Feb 13, 2026

0.3.7

Feb 3, 2026

0.3.6

Feb 3, 2026

0.3.5

Feb 3, 2026

0.3.4

Feb 3, 2026

0.3.3

Feb 3, 2026

0.3.2

Feb 2, 2026

0.3.1

Feb 2, 2026

This version

0.3.0

Feb 2, 2026

0.2.9

Feb 2, 2026

0.2.8

Feb 2, 2026

0.2.7

Feb 2, 2026

0.2.6

Feb 2, 2026

0.2.5

Feb 2, 2026

0.2.4

Feb 1, 2026

0.2.3

Jan 31, 2026

0.2.2

Jan 31, 2026

0.2.1

Jan 30, 2026

0.2.0

Jan 30, 2026

0.1.15

Jan 30, 2026

0.1.14

Jan 30, 2026

0.1.13

Jan 29, 2026

0.1.12

Jan 29, 2026

0.1.11

Jan 29, 2026

0.1.10

Jan 29, 2026

0.1.9

Jan 29, 2026

0.1.8

Jan 29, 2026

0.1.7

Jan 29, 2026

0.1.6

Jan 29, 2026

0.1.5

Jan 29, 2026

0.1.4

Jan 29, 2026

0.1.3

Jan 29, 2026

0.1.2

Jan 29, 2026

0.1.1

Jan 29, 2026

0.1.0

Jan 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mps_flash_attn-0.3.0.tar.gz (384.8 kB view details)

Uploaded Feb 2, 2026 Source

File details

Details for the file mps_flash_attn-0.3.0.tar.gz.

File metadata

Download URL: mps_flash_attn-0.3.0.tar.gz
Upload date: Feb 2, 2026
Size: 384.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for mps_flash_attn-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`c20bf584a04d89f9648a3b8abc3ca87ec201e50a85e6b0d63fc0b774d3aa2352`
MD5	`b6efae2b7ba21007ee1bd4c26abaa7ed`
BLAKE2b-256	`6ab7fcf48e7f4b00da57022e46768a864bcad65850a31f02f3fb8a71abdbb93e`

See more details on using hashes here.

mps-flash-attn 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MPS Flash Attention

Performance

Installation

Build from source

Usage

Basic Attention

Causal Masking

Sliding Window (Mistral/Llama 3.2)

Quantized KV Cache (2-4x memory savings)

100K+ Long Sequences

Drop-in SDPA Replacement

torch.compile() Support

Training with BF16 Backward

Benchmarking

Features

Architecture

Requirements

TODO / Future Optimizations

Credits

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes