Run AI models too large for your Mac's memory — expert caching, speculative execution, and 15+ research techniques for MoE inference on Apple Silicon
Project description
MLX-Flash
Run AI models too large for your Mac's memory — at near-full speed.
Your MacBook has 32-48GB of RAM, but the best AI models need 100-200GB+. MLX-Flash makes them run anyway by intelligently caching the most-needed parts in RAM and streaming the rest from your SSD.
How It Works
Think of it like Netflix streaming: instead of downloading the entire movie before watching, you buffer what you need and stream the rest. MLX-Flash does this for AI model weights:
┌─────────────────────────────────┐
│ Your Mac's RAM (fast) │
│ ┌──────────┐ ┌──────────────┐ │
│ │Hot Cache │ │Mixed Precis. │ │
│ │85%+ hits │ │4-bit / 2-bit │ │
│ └──────────┘ └──────────────┘ │
└──────────────┬──────────────────┘
│ cache hit: 0.08ms
┌──────────────┴──────────────────┐
│ Smart Cache Layer │
│ • LCP Eviction (layer-biased) │
│ • Speculative Prefetch (97%) │
│ • Memory Monitor │
│ • Speculative Execution │
└──────────────┬──────────────────┘
│ cache miss: 0.6ms
┌──────────────┴──────────────────┐
│ Your SSD (big) │
│ Full model weights — 200GB+ │
│ Entropy-coded — 65% smaller │
└─────────────────────────────────┘
│
▼
MLX GPU Inference
Result: A 200GB AI model runs on your 48GB Mac at 2-3x faster than naive SSD streaming.
Quick Start
pip install mlx-flash
# Interactive chat
mlx-flash-chat
# API server (works with LM Studio, Cursor, Claude Code, Codex, OpenAI SDK)
mlx-flash --port 8080
# With KV cache quantization (45% less KV memory)
mlx-flash --port 8080 --kv-bits 8
# See what models fit your hardware
mlx-flash-browse
Performance
| Technique | Speedup | How It Works |
|---|---|---|
| LCP Smart Cache | 2.80x | Keeps frequently-used model parts in RAM |
| + Async Prefetch | 2.93x | Loads next part from SSD while GPU computes |
| Mixed Precision | 1.80x smaller | Rarely-used parts stored at lower quality |
| Skip Fallback | 2.67x | Gracefully skip uncached parts instead of waiting |
| Speculative Execution | 14-42% TPOT | Execute predicted experts before router confirms |
| Adaptive Top-K | 10-30% compute | Skip low-confidence secondary experts |
Real Hardware (M3 Max 36GB)
Memory pressure recovery:
Without optimization: 43.5 tok/s
With mixed precision: 104.5 tok/s → 2.4x faster
Cache warm-up:
Token 0: 83.3ms (cold start)
Token 8: 5.7ms (warming up)
Token 24: 0.5ms (full speed) → 41x speedup
What's Inside
35 Python modules + Rust sidecar implementing 15+ research techniques:
| Category | Modules |
|---|---|
| Expert Streaming | GPU lookup tables, speculative execution, skip-fallback, adaptive top-k |
| Prediction (97%+) | Residual-stream predictor, shadow MLP, cross-layer 3-hop prefetch |
| Cache Management | Layer-biased LCP, Belady-optimal eviction, vertical splitting, expert merging |
| Compression | Entropy coding (Huffman uint4), mixed precision (4-bit/2-bit) |
| Memory | Real-time pressure monitoring, wired memory optimization, mx.clear_cache() |
| Serving | OpenAI-compatible API, KV cache 8-bit quantization, SSE streaming |
| Rust Sidecar | axum HTTP/SSE, mach2 memory (0.1ms), DashMap LCP, Unix socket bridge |
Integration
Works with any OpenAI-compatible tool:
# Start server
mlx-flash --port 8080 --preload
# Point any tool at it
# LM Studio: Settings → Server → http://localhost:8080/v1
# Cursor: Settings → Models → OpenAI Compatible → http://localhost:8080/v1
# Claude Code: OPENAI_API_BASE=http://localhost:8080/v1
# continue.dev: apiBase: http://localhost:8080/v1
# Python SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Expert Streaming (for large MoE models)
from mlx_flash_compress.expert_streaming import (
enable_expert_streaming, enable_skip_fallback
)
# Enable streaming with 50% capacity + adaptive skipping
streaming = enable_expert_streaming(model, capacity_per_layer=64)
enable_skip_fallback(model, streaming.caches, adaptive_skip_threshold=3.0)
streaming.warmup()
Research Techniques Implemented
From 15+ papers (2024-2026):
| Technique | Paper | Status |
|---|---|---|
| Expert streaming (GPU lookup) | HOBBIT arXiv:2411.01433 | Implemented |
| Residual-stream predictor (97%+) | Speculating Experts arXiv:2603.19289 | Implemented |
| Speculative execution (14-42% TPOT) | MoE-SpAc arXiv:2603.09983 | Implemented |
| Belady-optimal eviction | MoE-SpeQ arXiv:2511.14102 | Implemented |
| Cross-layer 3-hop prefetch | FATE arXiv:2502.12224 | Implemented |
| Layer-depth cache bias | FATE arXiv:2502.12224 | Implemented |
| Vertical expert splitting (2x coverage) | MoEpic paper | Implemented |
| Expert merging (15-30% fewer params) | DEK/EEP arXiv:2509.19781 | Implemented |
| Entropy coding (65% compression) | EntroLLM arXiv:2505.02380 | Implemented |
| Adaptive top-k (10-30% compute savings) | LExI arXiv:2509.02753 | Implemented |
| Mixed precision per-expert | HOBBIT arXiv:2411.01433 | Implemented |
| KV cache 8-bit quantization | mlx-moe / mlx-lm | Implemented |
Requirements
- macOS with Apple Silicon (M1/M2/M3/M4/M5)
- Python 3.10+
- 16GB+ RAM (more = better caching = faster)
Project Stats
- 15,000+ lines of code (Python + Rust)
- 224 tests (192 Python + 32 Rust)
- 35 Python modules + Rust sidecar
- 15+ research papers implemented
Links
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlx_flash-0.6.1.tar.gz.
File metadata
- Download URL: mlx_flash-0.6.1.tar.gz
- Upload date:
- Size: 170.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
357f4383c5d41843ca3a8523ff0788944381f5cb824e7dd7b8879a04c7460bf9
|
|
| MD5 |
19659518ad00202b4240fd9b870b42ca
|
|
| BLAKE2b-256 |
3110b6f1913ff1872dcf03d291b5199308f6613659acf84ead3c663eca38f31b
|
File details
Details for the file mlx_flash-0.6.1-py3-none-any.whl.
File metadata
- Download URL: mlx_flash-0.6.1-py3-none-any.whl
- Upload date:
- Size: 165.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23ee1774122ac9e3302799a9c53a67e49c6bfa0688314a919f8dff850a28a542
|
|
| MD5 |
c75cfde28a461501eee8ad3bd0de5dc1
|
|
| BLAKE2b-256 |
8cacc8192563cf5c583bbb6284b889be905d742c082ecef00e55367f577ba9cf
|