Skip to main content

CLI control plane for local LLM infrastructure on Apple Silicon

Project description

mlx-stack

CLI control plane for local LLM inference infrastructure on Apple Silicon.

CI PyPI Python 3.13+ License: MIT Platform: macOS Apple Silicon


Table of Contents

Architecture

                        ┌──────────────────────────────────────────────────┐
                        │                  mlx-stack CLI                   │
                        │  hardware detection · recommendation · lifecycle │
                        └──────────────┬───────────────────────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              │                        │                        │
              ▼                        ▼                        ▼
   ┌───────────────────┐  ┌───────────────────┐  ┌───────────────────┐
   │   vllm-mlx :8000  │  │   vllm-mlx :8001  │  │   vllm-mlx :8002  │
   │  ── standard ──   │  │    ── fast ──      │  │  ── longctx ──    │
   │  Qwen 3.5 14B     │  │  Qwen 3.5 3B      │  │  DeepSeek R1 8B   │
   └────────┬──────────┘  └────────┬──────────┘  └────────┬──────────┘
            │                      │                       │
            └──────────────────────┼───────────────────────┘
                                   │
                        ┌──────────▼──────────┐
                        │  LiteLLM Proxy :4000│
                        │  routing · fallback  │
                        │  load balancing      │
                        └──────────┬──────────┘
                                   │
                        ┌──────────▼──────────┐
                        │  OpenAI-compatible   │
                        │  /v1 endpoint        │
                        │                      │
                        │  ← Your app / agent  │
                        └─────────────────────┘

mlx-stack orchestrates vllm-mlx model servers and a LiteLLM API gateway to serve large language models locally on Apple Silicon Macs. Each tier runs a dedicated model optimized for a specific workload — quality, speed, or long-context — and LiteLLM routes requests through a single OpenAI-compatible endpoint with automatic fallback.

Feature Highlights

  • Hardware-Aware Recommendations — Detects your Apple Silicon chip (M1–M5, Pro/Max/Ultra), measures memory bandwidth, and recommends an optimal model stack tuned to your exact hardware.
  • Tiered Model Serving — Assigns models to standard, fast, and longctx tiers so agents and apps can target the right balance of quality and speed per request.
  • 24/7 Unattended Operation — Built-in watchdog with auto-restart, flap detection, exponential backoff, and macOS LaunchAgent integration for always-on inference on headless Mac Minis.
  • One-Command Setupmlx-stack init --accept-defaults profiles your hardware, picks models, generates configs, and gets you from zero to a running OpenAI-compatible endpoint in minutes.
  • 15-Model Curated Catalog — Ships with benchmark data for Qwen 3.5, Gemma 3, DeepSeek R1, Nemotron, and Llama 3.3 families — with quality scores, tool-calling metadata, and per-hardware performance data.

Installation

The recommended way to install mlx-stack is with uv:

uv tool install mlx-stack

This installs mlx-stack globally as an isolated tool — no need to manage virtual environments.

Alternatively, you can use pipx:

pipx install mlx-stack

Or try it without installing:

uvx mlx-stack profile

Note: uvx runs in an ephemeral environment, which works great for one-off commands. For the watchdog and LaunchAgent features (mlx-stack watch, mlx-stack install), use uv tool install so the binary has a stable path.

Quick Start

# 1. Detect your hardware
mlx-stack profile

# 2. Generate stack configuration
mlx-stack init --accept-defaults

# 3. Download required models
mlx-stack pull qwen3.5-8b

# 4. Start all services
mlx-stack up

# 5. Verify
mlx-stack status

The OpenAI-compatible API is now available at http://localhost:4000/v1.

# Stop everything when done
mlx-stack down

CLI Reference

Setup & Configuration

Command Description
mlx-stack profile Detect Apple Silicon hardware and save profile to ~/.mlx-stack/profile.json
mlx-stack config set <key> <value> Set a configuration value
mlx-stack config get <key> Get a configuration value
mlx-stack config list List all configuration values with defaults and sources
mlx-stack config reset --yes Reset all configuration to defaults

Model Management

mlx-stack recommend — Recommend an optimal model stack based on your hardware profile.

Option Description
--budget <value> Memory budget override (e.g., 30gb). Defaults to 40% of unified memory
--intent <balanced|agent-fleet> Optimization strategy
--show-all Show all budget-fitting models ranked by score

mlx-stack models — List locally downloaded models with disk size, quantization, and active stack status.

Option Description
--catalog Show all catalog models with hardware-specific benchmark data
--family <name> Filter by model family (e.g., qwen3.5)
--tag <name> Filter by tag (e.g., agent-ready)
--tool-calling Filter to tool-calling-capable models only

mlx-stack pull <model> — Download a model from the catalog.

Option Description
--quant <int4|int8|bf16> Quantization level (default: int4)
--bench Run a quick benchmark after download
--force Re-download even if the model already exists

mlx-stack init — Generate stack definition and LiteLLM proxy configuration.

Option Description
--accept-defaults Use defaults without prompting
--intent <balanced|agent-fleet> Optimization strategy
--add <model> Add a model to the stack (repeatable)
--remove <tier> Remove a tier from the stack (repeatable)
--force Overwrite existing stack configuration

Stack Lifecycle

mlx-stack up — Start all services: one vllm-mlx process per tier plus the LiteLLM proxy.

Option Description
--dry-run Show exact commands without starting anything
--tier <name> Start only the specified tier

mlx-stack down — Stop all managed services (SIGTERM → 10s grace → SIGKILL).

Option Description
--tier <name> Stop only the specified tier

mlx-stack status — Show health and status of all services (healthy, degraded, down, crashed, stopped).

Option Description
--json Output in JSON format

Diagnostics

mlx-stack bench <target> — Benchmark a running tier or catalog model. Runs 3 iterations and compares against catalog thresholds (PASS/WARN/FAIL).

Option Description
--save Persist results for use by recommend and init scoring

Ops & Reliability

mlx-stack logs [service] — View and manage service logs. Without arguments, lists all log files.

Option Description
--follow / -f Follow log output in real-time
--tail <N> Show last N lines (default: 50)
--service <name> Filter to a specific service
--rotate Rotate eligible log files
--all Show archived and current logs chronologically

mlx-stack watch — Health monitor with auto-restart, flap detection, and log rotation.

Option Description
--interval <seconds> Polling interval (default: 30)
--max-restarts <N> Restarts before marking as flapping (default: 5)
--restart-delay <seconds> Base restart delay with exponential backoff (default: 5)
--daemon Run in background as a daemon

mlx-stack install — Install the watchdog as a macOS LaunchAgent.

Option Description
--status Show current LaunchAgent status

mlx-stack uninstall — Remove the watchdog LaunchAgent. Running services are not affected.

Configuration

Configuration is stored in ~/.mlx-stack/config.yaml. Available keys:

Key Default Description
openrouter-key (not set) OpenRouter API key for cloud fallback
default-quant int4 Default quantization level (int4, int8, bf16)
memory-budget-pct 40 Percentage of unified memory to budget for models (1–100)
litellm-port 4000 LiteLLM proxy port
model-dir ~/.mlx-stack/models Model storage directory
auto-health-check true Run health checks automatically on startup
log-max-size-mb 50 Maximum log file size in MB before rotation
log-max-files 3 Number of rotated log files to retain

24/7 Operation

mlx-stack is designed to run unattended on always-on hardware like a Mac Mini.

Quick setup

mlx-stack init --accept-defaults
mlx-stack install

This installs a macOS LaunchAgent that starts the watchdog on login. The watchdog:

  • Monitors service health every 30 seconds
  • Auto-restarts crashed processes with exponential backoff
  • Detects flapping services and stops restart loops
  • Rotates logs automatically to prevent unbounded disk usage

Manual monitoring

mlx-stack watch                  # Foreground with Rich status table
mlx-stack watch --interval 60   # Less frequent polling
mlx-stack watch --daemon         # Background without LaunchAgent

Log management

mlx-stack logs                   # List all log files
mlx-stack logs fast              # Last 50 lines of fast tier
mlx-stack logs fast --follow     # Stream in real-time
mlx-stack logs --rotate          # Rotate all eligible logs now

Removing the agent

mlx-stack uninstall

This stops the watchdog and removes the LaunchAgent plist. Running services are not affected.

Model Catalog

The built-in catalog includes 15 models across 5 families:

Family Models Parameters
Qwen 3.5 6 variants 0.8B, 3B, 8B, 14B, 32B, 72B
Gemma 3 3 variants 4B, 12B, 27B
DeepSeek R1 2 variants 8B, 32B
Nemotron 2 variants 8B, 49B
Qwen 3 / Llama 3.3 2 variants 8B each

Each entry includes benchmark data for common Apple Silicon configurations, quality scores, and capability metadata (tool calling, thinking/reasoning, vision).

Architecture Details

mlx-stack manages a tiered local inference stack with three layers:

Model Servers (vllm-mlx)

One vllm-mlx instance per tier, each serving a single model on a dedicated port:

  • standard (port 8000) — Highest-quality model that fits your memory budget. Optimized for accuracy-sensitive tasks.
  • fast (port 8001) — Fastest model for latency-sensitive workloads like autocomplete and quick tool calls.
  • longctx (port 8002) — Architecturally diverse model (e.g., Mamba2 hybrid) for extended context windows.

Each server runs with continuous batching, paged KV cache, and automatic tool-call parsing enabled.

API Gateway (LiteLLM)

LiteLLM acts as the unified entry point on port 4000, providing:

  • OpenAI-compatible /v1 API — Drop-in replacement for api.openai.com in any client or agent framework.
  • Tier-based routing — Requests target specific tiers by model name, or fall through a configurable chain.
  • Automatic fallback — If the primary tier is unavailable, requests cascade to the next healthy tier.

Cloud Fallback (Optional)

With an OpenRouter API key configured, a premium cloud tier is available as a last-resort fallback, giving you access to frontier models when local capacity is insufficient.

Recommendation Engine

The recommendation engine scores all catalog models against your hardware profile:

  1. Hardware profiling — Detects chip variant, GPU cores, unified memory, and memory bandwidth.
  2. Memory budgeting — Filters models to those fitting within your configured memory budget (default: 40% of unified memory).
  3. Composite scoring — Weights speed, quality, tool-calling capability, and memory efficiency based on your chosen intent (balanced or agent-fleet).
  4. Tier assignment — Assigns top-scoring models to standard, fast, and longctx tiers.
  5. Local calibration — Saved benchmark data from mlx-stack bench --save overrides catalog estimates for precise scoring.

Process Management

  • PID tracking — Each service writes its PID to ~/.mlx-stack/pids/ for reliable lifecycle management.
  • Lockfile — Prevents concurrent up/down operations via fcntl.flock.
  • Health checks — HTTP polling with exponential backoff and 120-second timeout per service.
  • 5-state model — Services are reported as healthy, degraded, down, crashed, or stopped.
  • Graceful shutdown — SIGTERM with 10-second grace period, escalating to SIGKILL.

Development

See DEVELOPING.md for the full developer guide, including project architecture, testing strategy, and how to add new models or commands.

# Install dev dependencies
uv sync

# Run tests
uv run pytest

# Type checking
uv run python -m pyright

# Linting
uv run ruff check src/ tests/

Contributing

See CONTRIBUTING.md for guidelines on reporting bugs, suggesting features, and submitting pull requests.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_stack-0.2.0.tar.gz (371.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_stack-0.2.0-py3-none-any.whl (125.8 kB view details)

Uploaded Python 3

File details

Details for the file mlx_stack-0.2.0.tar.gz.

File metadata

  • Download URL: mlx_stack-0.2.0.tar.gz
  • Upload date:
  • Size: 371.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlx_stack-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c3a11dbd6edaf0d19de0591740585d098de75ee247f1014988559dd0b116974d
MD5 29c6c64617167b24aa5405a97eb94fa4
BLAKE2b-256 67473f006c8a20e5b6be187b899e5ed277cf36f64ccaa9368ea4c582e1a046f2

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_stack-0.2.0.tar.gz:

Publisher: publish.yml on weklund/mlx-stack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mlx_stack-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: mlx_stack-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 125.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlx_stack-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0216939161d3820122a1aafc34cb1393c434d673b6146464ee431e3d97637190
MD5 ddc81ad3b79ac6e16341fdd6416d6d6f
BLAKE2b-256 f5f019d5a064d54511322af484203301a86cb51938f994221b7e500797da4a17

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_stack-0.2.0-py3-none-any.whl:

Publisher: publish.yml on weklund/mlx-stack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page