Flash Attention for PyTorch on Apple Silicon (M1/M2/M3/M4)

These details have not been verified by PyPI

Project links

Project description

MPS Flash Attention

Flash Attention for PyTorch on Apple Silicon (M1/M2/M3/M4).

O(N) memory instead of O(N²), enabling 8K+ sequence lengths on unified memory.

Features

Forward pass: 2-5x faster than PyTorch SDPA
Backward pass: Full gradient support for training
Causal masking: Native kernel support (only 5% overhead)
FP16/FP32: Native fp16 output (no conversion overhead)
Pre-compiled kernels: Zero-compilation cold start (~6ms)

Performance

Tested on M1 Max, N=2048, B=4, H=8, D=64:

Operation	MPS Flash Attn	PyTorch SDPA	Speedup
Forward	5.3ms	15ms	2.8x
Forward+Backward	55ms	108ms	2.0x
Memory	80MB	592MB	7.4x less

Installation

Prerequisites

macOS 14+ (Sonoma) or macOS 15+ (Sequoia)
Xcode Command Line Tools (xcode-select --install)
Python 3.10+ with PyTorch 2.0+

Build from source

# Clone with submodules
git clone --recursive https://github.com/user/mps-flash-attention.git
cd mps-flash-attention

# Build Swift bridge
cd swift-bridge
swift build -c release
cd ..

# Install Python package
pip install -e .

Set environment variable

export MFA_BRIDGE_PATH=/path/to/mps-flash-attention/swift-bridge/.build/release/libMFABridge.dylib

Usage

Basic usage

from mps_flash_attn import flash_attention

# Standard attention (B, H, N, D)
q = torch.randn(2, 8, 4096, 64, device='mps', dtype=torch.float16)
k = torch.randn(2, 8, 4096, 64, device='mps', dtype=torch.float16)
v = torch.randn(2, 8, 4096, 64, device='mps', dtype=torch.float16)

out = flash_attention(q, k, v)

Causal masking (for autoregressive models)

out = flash_attention(q, k, v, is_causal=True)

Training with gradients

q.requires_grad = True
k.requires_grad = True
v.requires_grad = True

out = flash_attention(q, k, v, is_causal=True)
loss = out.sum()
loss.backward()  # Computes dQ, dK, dV

Drop-in replacement for SDPA

from mps_flash_attn import replace_sdpa

# Monkey-patch F.scaled_dot_product_attention
replace_sdpa()

# Now all attention ops use Flash Attention on MPS

Architecture

+----------------------------------------------------------+
|                    Python API                            |
|              mps_flash_attn/__init__.py                  |
|         (flash_attention, autograd Function)             |
+----------------------------+-----------------------------+
                             |
+----------------------------v-----------------------------+
|                 C++ Extension                            |
|            mps_flash_attn/csrc/mps_flash_attn.mm         |
|    (PyTorch bindings, MTLBuffer handling, offsets)       |
+----------------------------+-----------------------------+
                             | dlopen + dlsym
+----------------------------v-----------------------------+
|                 Swift Bridge                             |
|         swift-bridge/Sources/MFABridge/                  |
|   (MFABridge.swift, MetallibCache.swift)                 |
|   @_cdecl exports: mfa_init, mfa_create_kernel,          |
|                    mfa_forward, mfa_backward             |
+----------------------------+-----------------------------+
                             |
+----------------------------v-----------------------------+
|              Metal Flash Attention                       |
|    metal-flash-attention/Sources/FlashAttention/         |
|     (AttentionDescriptor, AttentionKernel, etc.)         |
|                                                          |
|   Generates Metal shader source at runtime,              |
|   compiles to .metallib, caches pipelines                |
+----------------------------------------------------------+

Project Structure

mps-flash-attention/
├── mps_flash_attn/              # Python package
│   ├── __init__.py              # Public API (flash_attention, replace_sdpa)
│   ├── csrc/
│   │   └── mps_flash_attn.mm    # PyTorch C++ extension
│   └── kernels/                 # Pre-compiled metallibs (optional)
│
├── swift-bridge/                # Swift -> C bridge
│   ├── Package.swift
│   └── Sources/MFABridge/
│       ├── MFABridge.swift      # C-callable API (@_cdecl)
│       └── MetallibCache.swift  # Disk caching for metallibs
│
├── metal-flash-attention/       # Upstream (git submodule)
│   └── Sources/FlashAttention/
│       └── Attention/
│           ├── AttentionDescriptor/  # Problem configuration
│           ├── AttentionKernel/      # Metal shader generation
│           └── ...
│
├── scripts/
│   └── build_metallibs.py       # Pre-compile kernels for distribution
│
└── setup.py                     # Python package setup

Changes from upstream metal-flash-attention

We made the following modifications to metal-flash-attention:

1. macOS 15+ compatibility (MTLLibraryCompiler.swift)

Apple restricted __asm in runtime-compiled Metal shaders on macOS 15. We added a fallback that uses xcrun metal CLI compilation when runtime compilation fails.

2. Causal masking support

Added causal flag to AttentionDescriptor and kernel generation:

AttentionDescriptor.swift: Added causal: Bool property
AttentionKernelDescriptor.swift: Added causal: Bool property
AttentionKernel.swift: Added causal field
AttentionKernel+Softmax.swift: Added maskCausal() function
AttentionKernel+Source.swift: Added causal masking to forward/backward loops

Next Steps

1. PR to upstream metal-flash-attention

The macOS 15 fix and causal masking should be contributed back:

cd metal-flash-attention
git checkout -b macos15-causal-support
# Commit changes to:
#   - Sources/FlashAttention/Utilities/MTLLibraryCompiler.swift (new file)
#   - Sources/FlashAttention/Attention/AttentionDescriptor/*.swift
#   - Sources/FlashAttention/Attention/AttentionKernel/*.swift
git push origin macos15-causal-support
# Open PR at https://github.com/philipturner/metal-flash-attention

2. Publish mps-flash-attention to PyPI

# Add pyproject.toml with proper metadata
# Build wheel with pre-compiled Swift bridge
python -m build
twine upload dist/*

3. Pre-compile kernels for zero cold start

python scripts/build_metallibs.py
# Copies metallibs to mps_flash_attn/kernels/
# These get shipped with the wheel

Current Status (Jan 2025)

Working:

Forward pass (fp16/fp32)
Backward pass (dQ, dK, dV gradients)
Causal masking
Metallib disk caching
Pipeline binary caching (MTLBinaryArchive)

Tested with:

train_frankenstein.py (video matting model) at 512x512 on MPS

Known limitations:

Sequence length must be divisible by block size (typically 64)
Head dimension: Best with 32, 64, 96, 128
No arbitrary attention masks (only causal or none)
No dropout

Credits

metal-flash-attention by Philip Turner
Flash Attention paper by Tri Dao et al.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

Apr 27, 2026

0.5.2

Apr 21, 2026

0.5.1

Feb 13, 2026

0.5.0

Feb 13, 2026

0.3.7

Feb 3, 2026

0.3.6

Feb 3, 2026

0.3.5

Feb 3, 2026

0.3.4

Feb 3, 2026

0.3.3

Feb 3, 2026

0.3.2

Feb 2, 2026

0.3.1

Feb 2, 2026

0.3.0

Feb 2, 2026

0.2.9

Feb 2, 2026

0.2.8

Feb 2, 2026

0.2.7

Feb 2, 2026

0.2.6

Feb 2, 2026

0.2.5

Feb 2, 2026

0.2.4

Feb 1, 2026

0.2.3

Jan 31, 2026

0.2.2

Jan 31, 2026

0.2.1

Jan 30, 2026

0.2.0

Jan 30, 2026

0.1.15

Jan 30, 2026

0.1.14

Jan 30, 2026

0.1.13

Jan 29, 2026

0.1.12

Jan 29, 2026

0.1.11

Jan 29, 2026

0.1.10

Jan 29, 2026

0.1.9

Jan 29, 2026

0.1.8

Jan 29, 2026

0.1.7

Jan 29, 2026

0.1.6

Jan 29, 2026

0.1.5

Jan 29, 2026

0.1.4

Jan 29, 2026

0.1.3

Jan 29, 2026

0.1.2

Jan 29, 2026

0.1.1

Jan 29, 2026

This version

0.1.0

Jan 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mps_flash_attn-0.1.0-cp314-cp314-macosx_15_0_arm64.whl (523.9 kB view details)

Uploaded Jan 29, 2026 CPython 3.14macOS 15.0+ ARM64

File details

Details for the file mps_flash_attn-0.1.0-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

Download URL: mps_flash_attn-0.1.0-cp314-cp314-macosx_15_0_arm64.whl
Upload date: Jan 29, 2026
Size: 523.9 kB
Tags: CPython 3.14, macOS 15.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for mps_flash_attn-0.1.0-cp314-cp314-macosx_15_0_arm64.whl
Algorithm	Hash digest
SHA256	`9a62b7743f6dfc4a53445abf20eb0b32c1944f5c8299c5d3318e3d04e08c6b2b`
MD5	`c3ee51fe62f8173df92513e82a7826f2`
BLAKE2b-256	`18bdf2685c5a178e2ea721aca2fb4d965775caa3d5ce2f745115a3d2e3ab8f3c`

See more details on using hashes here.

mps-flash-attn 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MPS Flash Attention

Features

Performance

Installation

Prerequisites

Build from source

Set environment variable

Usage

Basic usage

Causal masking (for autoregressive models)

Training with gradients

Drop-in replacement for SDPA

Architecture

Project Structure

Changes from upstream metal-flash-attention

1. macOS 15+ compatibility (MTLLibraryCompiler.swift)

2. Causal masking support

Next Steps

1. PR to upstream metal-flash-attention

2. Publish mps-flash-attention to PyPI

3. Pre-compile kernels for zero cold start

Current Status (Jan 2025)

Credits

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes