Benchmark-backed Metal Flash Attention backends for MLX on Apple Silicon

These details have not been verified by PyPI

Project links

Project description

mlx-mfa

mlx-mfa is a Metal Flash Attention + serving-oriented runtime layer for MLX on Apple Silicon. It provides high-performance attention kernels, runtime helpers, and cache abstractions for dense training/inference plus modern serving flows.

Current version: 2.20.0 — MFAEnvConfig refactor, V3 guard optimization, V5 per-D configs, comprehensive audit + docs update.

Foreword

MLX Metal Flash Attention - Why?

I've been working on personal ports of Video Super Resolution and Video Reconstruction models for months, but always ended up frustrated by the slow inference in my M1 Max MacBook Pro. And to try to mitigate this without having to buy a brand-new, very expensive new M4, then M5 Max, I decided to at least try to port Flash Attention to Mac, hoping for better results. And having better results porting VSR/VR models to MLX than MPS, that's why I ended up doing it.

At this point, despite the lower than hoped for results, I'm still pretty satisfied with the results in my M1 Max MBP.

I'll be doing only reduced work on this project until June 2026, when I'll upgrade from my M1 Max to a M5 Max MBP, with which I expect to be able to obtain much better results, thanks to the improvements Apple has been adding to its silicon.

v2.20.0 adds MFAEnvConfig (centralized env var caching), V3 dispatch guard optimization (B*H≥4, +35-67% for small-batch causal), V5 per-D block configs, comprehensive code audit with 14 tech debt fixes, and full documentation update. See CHANGELOG.md for full details per version.

Thank you for your interest, and let me know if you've been able to improve on my work!

Current Repository Status

V2 dense is the main production path.
Strongest dense wins on M1 Max remain causal D=64/128 and tile-skip regimes (window/sparse).
D=256 is narrow benchmark-backed only (not broad promotion).
D=512 remains SDPA-default.
Native dense backward was benchmarked and not promoted.
Sage is a specialized decode backend (narrow, benchmark-gated use).
V3/V4/V5 remain experimental/hardware-dependent.
Serving/runtime capability surface is now substantially expanded:
- paged KV + packed varlen query support
- paged continuous batching/remap
- explicit chunked prefill
- runtime-managed prefix reuse
- runtime speculative draft/verify flow
- deeper splitfuse runtime integration
- KV cache abstraction layer
- minimal real hybrid/offload-capable cache behavior (local offload tier)

Limitations

Main validation hardware is Apple M1 Max.
Broad parity claims against CUDA FlashAttention ecosystems are not made.
Some advanced paths are intentionally narrow, bridge-based, or explicit-only.
Hybrid offload is currently a local offload milestone, not remote/ distributed cache infrastructure.
Future major hardware-specific optimization work is deferred pending newer Apple hardware (M5+).

Best M1 Max Benchmark Highlights

Representative benchmark-backed outcomes (see RESULTS.md and docs/benchmarks/RESULTS.md for details):

Area	Representative result (M1 Max)	Interpretation
Dense causal V2	up to ~1.82x vs SDPA (D=64, N=8192)	Primary production win regime
Dense causal V2	up to ~1.75x vs SDPA (D=128, N=16384)	Strong long-sequence causal performance
Sliding window	up to ~21x vs full SDPA	Tile-skip regime remains strongest
D=256	narrow causal long-N wins (for example ~1.16x at N=16384 f16)	Keep narrow policy only
D=512	decision pass found no broad wins	SDPA-default remains correct

Serving/Runtime Capability Summary

Capability	Maturity	Current status
Paged KV decode runtime	Fully usable	Explicit runtime/API usage; no broad auto-promotion
Paged + packed varlen queries	Production (fused kernel)	Single-dispatch fused kernel for all query/KV length combinations
Paged continuous batching remap	Fully usable	Explicit `cache_batch_idx` semantics + runtime helpers
Chunked prefill	Fully usable (scheduler-oriented)	Operational capability; not a throughput win on current matrix
Runtime prefix caching	Fully usable	Register/seed/reuse path integrated with runtime metadata
Runtime speculative decode	Fully usable (narrow)	`speculative_step` + verify integration; scheduler engine still future work
Splitfuse runtime integration	Narrow/conditional	Runtime path exists; performance remains shape-sensitive
Hybrid KV cache + local offload tier	Narrow/conditional milestone	Real hot/cold/offloaded behavior locally; remote offload future work
External cache adapter layer	Experimental groundwork	Concrete local backend provided; external backend integrations pending

Repository Guide

API manual: docs/API_MANUAL.md
Architecture: docs/ARCHITECTURE.md
Inventory map: docs/INVENTORY.md
Benchmark interpretation: docs/benchmarks/RESULTS.md
Root benchmark summary: RESULTS.md
Changelog: CHANGELOG.md
Historical development archive: devnotes/
Examples: examples/

Production vs Narrow vs Experimental

Status	Components
Production	V2 dense causal small-D path; window/sparse tile-skip; SDPA fallback policy
Narrow / conditional	D=256 causal long-N policy; Sage decode regimes; splitfuse/page-native runtime paths; hybrid local offload behavior
Experimental	V3/V4/V5 families; external/LMCache-like backend extensions beyond local adapter

Recommended Usage

Use backend="auto" for dense attention and let policy route between V2 and SDPA.
Use create_decode_runtime(...) for serving flows instead of stitching helper calls manually.
Treat paged/packed/chunked/prefix/speculative features as explicit runtime capabilities.
Use Sage as a specialized decode backend only when your workload matches the benchmark-backed regime.

Installation

pip install -e .

Minimal Usage

import mlx.core as mx
from mlx_mfa import flash_attention, create_decode_runtime

# Dense attention
q = mx.random.normal((1, 8, 1024, 128)).astype(mx.float16)
k = mx.random.normal((1, 8, 1024, 128)).astype(mx.float16)
v = mx.random.normal((1, 8, 1024, 128)).astype(mx.float16)
out = flash_attention(q, k, v, causal=True)

# Serving-oriented runtime
rt = create_decode_runtime(
    backend="auto",
    paged=False,
    quantized_kv=False,
    B=1,
    H_q=8,
    H_kv=8,
    D=128,
    max_seq_len=4096,
)
out_prefill = rt.prefill(q, k, v)
out_step = rt.step(
    mx.random.normal((1, 8, 1, 128)).astype(mx.float16),
    mx.random.normal((1, 8, 1, 128)).astype(mx.float16),
    mx.random.normal((1, 8, 1, 128)).astype(mx.float16),
)

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.27.0

Apr 6, 2026

2.26.0

Mar 31, 2026

This version

2.21.0

Mar 26, 2026

2.20.1

Mar 21, 2026

2.20.0

Mar 21, 2026

2.14.3

Mar 18, 2026

2.13.0

Mar 18, 2026

2.11.0

Mar 17, 2026

2.10.0

Mar 13, 2026

2.9.2

Mar 13, 2026

2.6.1

Mar 11, 2026

2.5.2

Mar 11, 2026

2.5.1

Mar 11, 2026

2.5.0

Mar 11, 2026

2.4.0

Mar 10, 2026

1.2.3

Mar 9, 2026

1.2.1

Mar 9, 2026

1.2.0

Mar 9, 2026

1.1.0

Mar 9, 2026

1.0.5

Mar 8, 2026

1.0.4

Mar 8, 2026

1.0.2

Mar 6, 2026

1.0.1

Mar 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_mfa-2.21.0.tar.gz (672.9 kB view details)

Uploaded Mar 26, 2026 Source

File details

Details for the file mlx_mfa-2.21.0.tar.gz.

File metadata

Download URL: mlx_mfa-2.21.0.tar.gz
Upload date: Mar 26, 2026
Size: 672.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for mlx_mfa-2.21.0.tar.gz
Algorithm	Hash digest
SHA256	`64b68a3ddfa3127bfbc9dd09d34b6a946d29ae58c60316d5f79f37f6d164c1c4`
MD5	`98a782b913ec420de5b1fc7c309de28c`
BLAKE2b-256	`47bdd41e3bc5f913fb1b119f3e4fcfb7a07705259a20176b7d6443f370c8393f`

See more details on using hashes here.

mlx-mfa 2.21.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mlx-mfa

Foreword

Current Repository Status

Limitations

Best M1 Max Benchmark Highlights

Serving/Runtime Capability Summary

Repository Guide

Production vs Narrow vs Experimental

Recommended Usage

Installation

Minimal Usage

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes