Run AI models too large for your Mac's memory — expert caching, speculative execution, and 15+ research techniques for MoE inference on Apple Silicon

These details have not been verified by PyPI

Project links

Project description

MLX-Flash Logo

MLX-Flash

Run AI models too large for your Mac's memory — at near-full speed.

Your MacBook has 32-48GB of RAM, but the best AI models need 100-200GB+. MLX-Flash makes them run anyway by intelligently caching the most-needed parts in RAM and streaming the rest from your SSD.

How It Works

Think of it like Netflix streaming: instead of downloading the entire movie before watching, you buffer what you need and stream the rest. MLX-Flash does this for AI model weights:

┌─────────────────────────────────┐
│    Your Mac's RAM (fast)        │
│  ┌──────────┐ ┌──────────────┐  │
│  │Hot Cache  │ │Mixed Precis. │  │
│  │85%+ hits  │ │4-bit / 2-bit │  │
│  └──────────┘ └──────────────┘  │
└──────────────┬──────────────────┘
               │ cache hit: 0.08ms
┌──────────────┴──────────────────┐
│    Smart Cache Layer            │
│  • LCP Eviction (layer-biased) │
│  • Speculative Prefetch (97%)  │
│  • Memory Monitor              │
│  • Speculative Execution       │
└──────────────┬──────────────────┘
               │ cache miss: 0.6ms
┌──────────────┴──────────────────┐
│    Your SSD (big)               │
│  Full model weights — 200GB+   │
│  Entropy-coded — 65% smaller   │
└─────────────────────────────────┘
         │
         ▼
   MLX GPU Inference

Result: A 200GB AI model runs on your 48GB Mac at 2-3x faster than naive SSD streaming.

Quick Start

pip install mlx-flash

# Interactive chat
mlx-flash-chat

# API server (works with LM Studio, Cursor, Claude Code, Codex, OpenAI SDK)
mlx-flash --port 8080

# With KV cache quantization (45% less KV memory)
mlx-flash --port 8080 --kv-bits 8

# See what models fit your hardware
mlx-flash-browse

Performance

Technique	Speedup	How It Works
LCP Smart Cache	2.80x	Keeps frequently-used model parts in RAM
+ Async Prefetch	2.93x	Loads next part from SSD while GPU computes
Mixed Precision	1.80x smaller	Rarely-used parts stored at lower quality
Skip Fallback	2.67x	Gracefully skip uncached parts instead of waiting
Speculative Execution	14-42% TPOT	Execute predicted experts before router confirms
Adaptive Top-K	10-30% compute	Skip low-confidence secondary experts

Real Hardware (M3 Max 36GB)

Memory pressure recovery:
  Without optimization:    43.5 tok/s
  With mixed precision:   104.5 tok/s  → 2.4x faster

Cache warm-up:
  Token  0:  83.3ms (cold start)
  Token  8:   5.7ms (warming up)
  Token 24:   0.5ms (full speed) → 41x speedup

What's Inside

35 Python modules + Rust sidecar implementing 15+ research techniques:

Category	Modules
Expert Streaming	GPU lookup tables, speculative execution, skip-fallback, adaptive top-k
Prediction (97%+)	Residual-stream predictor, shadow MLP, cross-layer 3-hop prefetch
Cache Management	Layer-biased LCP, Belady-optimal eviction, vertical splitting, expert merging
Compression	Entropy coding (Huffman uint4), mixed precision (4-bit/2-bit)
Memory	Real-time pressure monitoring, wired memory optimization, `mx.clear_cache()`
Serving	OpenAI-compatible API, KV cache 8-bit quantization, SSE streaming
Rust Sidecar	axum HTTP/SSE, mach2 memory (0.1ms), DashMap LCP, Unix socket bridge

Integration

Works with any OpenAI-compatible tool:

# Start server
mlx-flash --port 8080 --preload

# Point any tool at it
# LM Studio: Settings → Server → http://localhost:8080/v1
# Cursor: Settings → Models → OpenAI Compatible → http://localhost:8080/v1
# Claude Code: OPENAI_API_BASE=http://localhost:8080/v1
# continue.dev: apiBase: http://localhost:8080/v1

# Python SDK
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Expert Streaming (for large MoE models)

from mlx_flash_compress.expert_streaming import (
    enable_expert_streaming, enable_skip_fallback
)

# Enable streaming with 50% capacity + adaptive skipping
streaming = enable_expert_streaming(model, capacity_per_layer=64)
enable_skip_fallback(model, streaming.caches, adaptive_skip_threshold=3.0)
streaming.warmup()

Research Techniques Implemented

From 15+ papers (2024-2026):

Technique	Paper	Status
Expert streaming (GPU lookup)	HOBBIT arXiv:2411.01433	Implemented
Residual-stream predictor (97%+)	Speculating Experts arXiv:2603.19289	Implemented
Speculative execution (14-42% TPOT)	MoE-SpAc arXiv:2603.09983	Implemented
Belady-optimal eviction	MoE-SpeQ arXiv:2511.14102	Implemented
Cross-layer 3-hop prefetch	FATE arXiv:2502.12224	Implemented
Layer-depth cache bias	FATE arXiv:2502.12224	Implemented
Vertical expert splitting (2x coverage)	MoEpic paper	Implemented
Expert merging (15-30% fewer params)	DEK/EEP arXiv:2509.19781	Implemented
Entropy coding (65% compression)	EntroLLM arXiv:2505.02380	Implemented
Adaptive top-k (10-30% compute savings)	LExI arXiv:2509.02753	Implemented
Mixed precision per-expert	HOBBIT arXiv:2411.01433	Implemented
KV cache 8-bit quantization	mlx-moe / mlx-lm	Implemented

Requirements

macOS with Apple Silicon (M1/M2/M3/M4/M5)
Python 3.10+
16GB+ RAM (more = better caching = faster)

Project Stats

15,000+ lines of code (Python + Rust)
224 tests (192 Python + 32 Rust)
35 Python modules + Rust sidecar
15+ research papers implemented

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.7.1

May 11, 2026

0.7.0

May 10, 2026

0.6.1

Apr 6, 2026

0.6.0

Apr 6, 2026

0.5.1

Apr 1, 2026

This version

0.5.0

Apr 1, 2026

0.4.0

Apr 1, 2026

0.3.0

Apr 1, 2026

0.2.1

Apr 1, 2026

0.2.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_flash-0.5.0.tar.gz (144.5 kB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_flash-0.5.0-py3-none-any.whl (140.5 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file mlx_flash-0.5.0.tar.gz.

File metadata

Download URL: mlx_flash-0.5.0.tar.gz
Upload date: Apr 1, 2026
Size: 144.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for mlx_flash-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`2e3cf4f2a5ef402b7916a2f0ac2ffdef1d6ef50864790164b184f319e061d0c8`
MD5	`ca9128f7163ab1e445548b5d602165f5`
BLAKE2b-256	`b4e466a815042dbd893ca8ad4081e313633cc7ab533524c844314cab5b1cf7c7`

See more details on using hashes here.

File details

Details for the file mlx_flash-0.5.0-py3-none-any.whl.

File metadata

Download URL: mlx_flash-0.5.0-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 140.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for mlx_flash-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1fd40ace566facd50b28224ade14428718dc212cfce50a55ba013ce129c0f0d2`
MD5	`3198e5e7b6009a12eeb12ea7b3f4d575`
BLAKE2b-256	`ae8a6f0df917d6f7c098f87f454dd9917898d60289bcf68effcb7838ac32e3c8`

See more details on using hashes here.

mlx-flash 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MLX-Flash

How It Works

Quick Start

Performance

Real Hardware (M3 Max 36GB)

What's Inside

Integration

Expert Streaming (for large MoE models)

Research Techniques Implemented

Requirements

Project Stats

Links

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes