Skip to main content

Run AI models too large for your Mac's memory — expert caching, speculative execution, and 15+ research techniques for MoE inference on Apple Silicon

Project description

MLX-Flash Logo

MLX-Flash

Run AI models too large for your Mac's memory — at near-full speed.

GitHub Stars PyPI License


Your MacBook has 32-48GB of RAM, but the best AI models need 100-200GB+. MLX-Flash makes them run anyway by intelligently caching the most-needed parts in RAM and streaming the rest from your SSD.

How It Works

Think of it like Netflix streaming: instead of downloading the entire movie before watching, you buffer what you need and stream the rest. MLX-Flash does this for AI model weights:

┌─────────────────────────────────┐
│    Your Mac's RAM (fast)        │
│  ┌──────────┐ ┌──────────────┐  │
│  │Hot Cache  │ │Mixed Precis. │  │
│  │85%+ hits  │ │4-bit / 2-bit │  │
│  └──────────┘ └──────────────┘  │
└──────────────┬──────────────────┘
               │ cache hit: 0.08ms
┌──────────────┴──────────────────┐
│    Smart Cache Layer            │
│  • LCP Eviction (layer-biased) │
│  • Speculative Prefetch (97%)  │
│  • Memory Monitor              │
│  • Speculative Execution       │
└──────────────┬──────────────────┘
               │ cache miss: 0.6ms
┌──────────────┴──────────────────┐
│    Your SSD (big)               │
│  Full model weights — 200GB+   │
│  Entropy-coded — 65% smaller   │
└─────────────────────────────────┘
         │
         ▼
   MLX GPU Inference

Result: A 200GB AI model runs on your 48GB Mac at 2-3x faster than naive SSD streaming.

Quick Start

pip install mlx-flash
# Interactive chat
mlx-flash-chat

# API server (works with LM Studio, Cursor, Claude Code, Codex, OpenAI SDK)
mlx-flash --port 8080

# With KV cache quantization (45% less KV memory)
mlx-flash --port 8080 --kv-bits 8

# See what models fit your hardware
mlx-flash-browse

Performance

Technique Speedup How It Works
LCP Smart Cache 2.80x Keeps frequently-used model parts in RAM
+ Async Prefetch 2.93x Loads next part from SSD while GPU computes
Mixed Precision 1.80x smaller Rarely-used parts stored at lower quality
Skip Fallback 2.67x Gracefully skip uncached parts instead of waiting
Speculative Execution 14-42% TPOT Execute predicted experts before router confirms
Adaptive Top-K 10-30% compute Skip low-confidence secondary experts

Real Hardware (M3 Max 36GB)

Memory pressure recovery:
  Without optimization:    43.5 tok/s
  With mixed precision:   104.5 tok/s  → 2.4x faster

Cache warm-up:
  Token  0:  83.3ms (cold start)
  Token  8:   5.7ms (warming up)
  Token 24:   0.5ms (full speed) → 41x speedup

What's Inside

35 Python modules + Rust sidecar implementing 15+ research techniques:

Category Modules
Expert Streaming GPU lookup tables, speculative execution, skip-fallback, adaptive top-k
Prediction (97%+) Residual-stream predictor, shadow MLP, cross-layer 3-hop prefetch
Cache Management Layer-biased LCP, Belady-optimal eviction, vertical splitting, expert merging
Compression Entropy coding (Huffman uint4), mixed precision (4-bit/2-bit)
Memory Real-time pressure monitoring, wired memory optimization, mx.clear_cache()
Serving OpenAI-compatible API, KV cache 8-bit quantization, SSE streaming
Rust Sidecar axum HTTP/SSE, mach2 memory (0.1ms), DashMap LCP, Unix socket bridge

Integration

Works with any OpenAI-compatible tool:

# Start server
mlx-flash --port 8080 --preload

# Point any tool at it
# LM Studio: Settings → Server → http://localhost:8080/v1
# Cursor: Settings → Models → OpenAI Compatible → http://localhost:8080/v1
# Claude Code: OPENAI_API_BASE=http://localhost:8080/v1
# continue.dev: apiBase: http://localhost:8080/v1
# Python SDK
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Expert Streaming (for large MoE models)

from mlx_flash_compress.expert_streaming import (
    enable_expert_streaming, enable_skip_fallback
)

# Enable streaming with 50% capacity + adaptive skipping
streaming = enable_expert_streaming(model, capacity_per_layer=64)
enable_skip_fallback(model, streaming.caches, adaptive_skip_threshold=3.0)
streaming.warmup()

Research Techniques Implemented

From 15+ papers (2024-2026):

Technique Paper Status
Expert streaming (GPU lookup) HOBBIT arXiv:2411.01433 Implemented
Residual-stream predictor (97%+) Speculating Experts arXiv:2603.19289 Implemented
Speculative execution (14-42% TPOT) MoE-SpAc arXiv:2603.09983 Implemented
Belady-optimal eviction MoE-SpeQ arXiv:2511.14102 Implemented
Cross-layer 3-hop prefetch FATE arXiv:2502.12224 Implemented
Layer-depth cache bias FATE arXiv:2502.12224 Implemented
Vertical expert splitting (2x coverage) MoEpic paper Implemented
Expert merging (15-30% fewer params) DEK/EEP arXiv:2509.19781 Implemented
Entropy coding (65% compression) EntroLLM arXiv:2505.02380 Implemented
Adaptive top-k (10-30% compute savings) LExI arXiv:2509.02753 Implemented
Mixed precision per-expert HOBBIT arXiv:2411.01433 Implemented
KV cache 8-bit quantization mlx-moe / mlx-lm Implemented

Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4/M5)
  • Python 3.10+
  • 16GB+ RAM (more = better caching = faster)

Project Stats

  • 15,000+ lines of code (Python + Rust)
  • 224 tests (192 Python + 32 Rust)
  • 35 Python modules + Rust sidecar
  • 15+ research papers implemented

Links

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_flash-0.5.1.tar.gz (155.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_flash-0.5.1-py3-none-any.whl (151.5 kB view details)

Uploaded Python 3

File details

Details for the file mlx_flash-0.5.1.tar.gz.

File metadata

  • Download URL: mlx_flash-0.5.1.tar.gz
  • Upload date:
  • Size: 155.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for mlx_flash-0.5.1.tar.gz
Algorithm Hash digest
SHA256 0c2733a6bee0c865439a2e0381e18783e44e080e78528ecd86413c315a9da995
MD5 5a0e30224de610240f982bc20bafee05
BLAKE2b-256 9e879038991e0ad7e78625736509a89e4c46d42e79c9a5709850ab9eff9f45a1

See more details on using hashes here.

File details

Details for the file mlx_flash-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: mlx_flash-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 151.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for mlx_flash-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9fd32356e145d52de4af7e8c5d2239e245d58680e679883ffca7587da677db1b
MD5 db2b3bdaa245882dcf52b1edae2b2d6a
BLAKE2b-256 d71fc6e2ed29eb78e8c698fb2b0ed2b856742c04c6100a7ebad907eaca464b35

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page