Skip to main content

Run AI models too large for your Mac's memory — expert caching, speculative execution, and 15+ research techniques for MoE inference on Apple Silicon

Project description

MLX-Flash Logo

MLX-Flash

Run AI models too large for your Mac's memory — at near-full speed.

GitHub Stars PyPI License


Your MacBook has 32-48GB of RAM, but the best AI models need 100-200GB+. MLX-Flash makes them run anyway by intelligently caching the most-needed parts in RAM and streaming the rest from your SSD.

How It Works

Think of it like Netflix streaming: instead of downloading the entire movie before watching, you buffer what you need and stream the rest. MLX-Flash does this for AI model weights:

┌─────────────────────────────────┐
│    Your Mac's RAM (fast)        │
│  ┌──────────┐ ┌──────────────┐  │
│  │Hot Cache  │ │Mixed Precis. │  │
│  │85%+ hits  │ │4-bit / 2-bit │  │
│  └──────────┘ └──────────────┘  │
└──────────────┬──────────────────┘
               │ cache hit: 0.08ms
┌──────────────┴──────────────────┐
│    Smart Cache Layer            │
│  • LCP Eviction (layer-biased) │
│  • Speculative Prefetch (97%)  │
│  • Memory Monitor              │
│  • Speculative Execution       │
└──────────────┬──────────────────┘
               │ cache miss: 0.6ms
┌──────────────┴──────────────────┐
│    Your SSD (big)               │
│  Full model weights — 200GB+   │
│  Entropy-coded — 65% smaller   │
└─────────────────────────────────┘
         │
         ▼
   MLX GPU Inference

Result: A 200GB AI model runs on your 48GB Mac at 2-3x faster than naive SSD streaming.

Quick Start

pip install mlx-flash
# Interactive chat
mlx-flash-chat

# API server (works with LM Studio, Cursor, Claude Code, Codex, OpenAI SDK)
mlx-flash --port 8080

# With KV cache quantization (45% less KV memory)
mlx-flash --port 8080 --kv-bits 8

# See what models fit your hardware
mlx-flash-browse

Performance

Technique Speedup How It Works
LCP Smart Cache 2.80x Keeps frequently-used model parts in RAM
+ Async Prefetch 2.93x Loads next part from SSD while GPU computes
Mixed Precision 1.80x smaller Rarely-used parts stored at lower quality
Skip Fallback 2.67x Gracefully skip uncached parts instead of waiting
Speculative Execution 14-42% TPOT Execute predicted experts before router confirms
Adaptive Top-K 10-30% compute Skip low-confidence secondary experts

Real Hardware (M3 Max 36GB)

Memory pressure recovery:
  Without optimization:    43.5 tok/s
  With mixed precision:   104.5 tok/s  → 2.4x faster

Cache warm-up:
  Token  0:  83.3ms (cold start)
  Token  8:   5.7ms (warming up)
  Token 24:   0.5ms (full speed) → 41x speedup

What's Inside

35 Python modules + Rust sidecar implementing 15+ research techniques:

Category Modules
Expert Streaming GPU lookup tables, speculative execution, skip-fallback, adaptive top-k
Prediction (97%+) Residual-stream predictor, shadow MLP, cross-layer 3-hop prefetch
Cache Management Layer-biased LCP, Belady-optimal eviction, vertical splitting, expert merging
Compression Entropy coding (Huffman uint4), mixed precision (4-bit/2-bit)
Memory Real-time pressure monitoring, wired memory optimization, mx.clear_cache()
Serving OpenAI-compatible API, KV cache 8-bit quantization, SSE streaming
Rust Sidecar axum HTTP/SSE, mach2 memory (0.1ms), DashMap LCP, Unix socket bridge

Integration

Works with any OpenAI-compatible tool:

# Start server
mlx-flash --port 8080 --preload

# Point any tool at it
# LM Studio: Settings → Server → http://localhost:8080/v1
# Cursor: Settings → Models → OpenAI Compatible → http://localhost:8080/v1
# Claude Code: OPENAI_API_BASE=http://localhost:8080/v1
# continue.dev: apiBase: http://localhost:8080/v1
# Python SDK
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Expert Streaming (for large MoE models)

from mlx_flash_compress.expert_streaming import (
    enable_expert_streaming, enable_skip_fallback
)

# Enable streaming with 50% capacity + adaptive skipping
streaming = enable_expert_streaming(model, capacity_per_layer=64)
enable_skip_fallback(model, streaming.caches, adaptive_skip_threshold=3.0)
streaming.warmup()

Research Techniques Implemented

From 15+ papers (2024-2026):

Technique Paper Status
Expert streaming (GPU lookup) HOBBIT arXiv:2411.01433 Implemented
Residual-stream predictor (97%+) Speculating Experts arXiv:2603.19289 Implemented
Speculative execution (14-42% TPOT) MoE-SpAc arXiv:2603.09983 Implemented
Belady-optimal eviction MoE-SpeQ arXiv:2511.14102 Implemented
Cross-layer 3-hop prefetch FATE arXiv:2502.12224 Implemented
Layer-depth cache bias FATE arXiv:2502.12224 Implemented
Vertical expert splitting (2x coverage) MoEpic paper Implemented
Expert merging (15-30% fewer params) DEK/EEP arXiv:2509.19781 Implemented
Entropy coding (65% compression) EntroLLM arXiv:2505.02380 Implemented
Adaptive top-k (10-30% compute savings) LExI arXiv:2509.02753 Implemented
Mixed precision per-expert HOBBIT arXiv:2411.01433 Implemented
KV cache 8-bit quantization mlx-moe / mlx-lm Implemented

Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4/M5)
  • Python 3.10+
  • 16GB+ RAM (more = better caching = faster)

Project Stats

  • 15,000+ lines of code (Python + Rust)
  • 224 tests (192 Python + 32 Rust)
  • 35 Python modules + Rust sidecar
  • 15+ research papers implemented

Links

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_flash-0.6.0.tar.gz (168.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_flash-0.6.0-py3-none-any.whl (164.9 kB view details)

Uploaded Python 3

File details

Details for the file mlx_flash-0.6.0.tar.gz.

File metadata

  • Download URL: mlx_flash-0.6.0.tar.gz
  • Upload date:
  • Size: 168.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for mlx_flash-0.6.0.tar.gz
Algorithm Hash digest
SHA256 c2d7623ed3f5715c740bb26174bce857b47dc967efb9e518b5d1bbe49176f5a4
MD5 c97dd91b1bf57bea43195e13cc8a3337
BLAKE2b-256 81c4e7d87bd353c59ac79c90b7cfd74dc1cf5fc6067d6c8c52e66e6a041f0f3d

See more details on using hashes here.

File details

Details for the file mlx_flash-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: mlx_flash-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 164.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for mlx_flash-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ff982fd731ed31949fdffa96da41e66d33134e7b02837c6bdbcc5ff9fee7b3e
MD5 de09835acc7189154b09831e5a17db2e
BLAKE2b-256 7b5f6d3f22d144acdab262901d53d4e3799f40ef21b9c47689a83d27a343329f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page