Skip to main content

Web-based MLX model manager for Apple Silicon Macs

Project description

MLX Manager

CI Coverage PyPI version License: MIT

Python 3.11-3.12 FastAPI Svelte 5 TypeScript

Run and serve local LLMs on your Mac with one command. MLX Manager provides a web UI for managing MLX-optimized models on Apple Silicon, with an embedded high-performance inference server exposing both OpenAI and Anthropic-compatible APIs.

Why MLX Manager?

Running local LLMs typically requires juggling multiple tools, config files, and terminal commands. Tools like Ollama and LM Studio make model management easier, but they rely on llama.cpp — a cross-platform C++ runtime that treats Apple Silicon as one target among many. MLX Manager takes a different approach: it includes a purpose-built inference server that calls the MLX framework directly, so every operation runs natively on Metal GPU without translation layers, format conversions, or cross-platform abstractions.

  • One-click model downloads from HuggingFace MLX models (mlx-community, lmstudio-community, and more)
  • Smart model discovery - filter by architecture (Llama, Qwen, Mistral), quantization (4-bit, 8-bit), and capabilities (multimodal, tool use)
  • Purpose-built inference server - direct MLX framework integration with OpenAI and Anthropic API compatibility, not a wrapper around llama.cpp
  • Multi-model, multi-type - load text, vision, embeddings, and audio models simultaneously with LRU eviction. Not one-model-at-a-time
  • Two-phase model lifecycle - models are probed at load time to discover capabilities (tool calling, thinking, streaming), then served with zero runtime overhead
  • Visual server management - start, stop, and monitor models with real-time CPU/memory metrics
  • Rich chat interface - test models with image/video/text attachments, thinking model support, and MCP tool integration
  • Cloud routing - seamlessly route requests to OpenAI/Anthropic APIs when local models can't handle them
  • User authentication - secure multi-user access with JWT auth and admin controls
  • Background service - models auto-start on login via macOS launchd
  • Menubar app - quick access from your Mac's status bar

Quick Start

Install

# Homebrew (recommended)
brew tap tumma72/mlx-manager https://github.com/tumma72/mlx-manager
brew install mlx-manager

# Or via pip
pip install mlx-manager

Run

mlx-manager serve

Open http://localhost:10242 and you're ready to:

  1. Register - Create your account (first user becomes admin)
  2. Browse - Search HuggingFace for MLX-optimized models
  3. Filter - Find models by architecture, quantization, or capabilities
  4. Download - One-click download with progress tracking
  5. Configure - Create a server profile with custom settings
  6. Run - Start serving and chat with your model

Use as an API

Once a model is loaded, use it with any OpenAI or Anthropic client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:10242/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="mlx-community/Llama-3.2-3B-Instruct-4bit",
    messages=[{"role": "user", "content": "Hello!"}],
)
import anthropic

client = anthropic.Anthropic(base_url="http://localhost:10242/v1", api_key="not-needed")
message = client.messages.create(
    model="mlx-community/Llama-3.2-3B-Instruct-4bit",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}],
)

Embedded Inference Server

MLX Manager includes a fully self-contained inference server mounted at /v1. No external inference backends needed — the server calls mlx-lm, mlx-vlm, mlx-embeddings, and mlx-audio directly, keeping everything on the Metal GPU without process boundaries or IPC overhead.

Why build our own instead of wrapping Ollama or llama.cpp? Those projects target every platform and GPU vendor, which means abstraction layers, GGUF format conversions, and lowest-common-denominator threading. MLX Manager's server is Apple Silicon-only by design: it uses a persistent Metal GPU thread with a job queue for thread affinity, list-buffer string assembly to avoid O(n^2) concatenation in streaming, and a two-phase probe-then-serve lifecycle that eliminates capability checks from the hot path. The result is an inference server that behaves like a native macOS application — because it is one.

Multi-Protocol API

Protocol Endpoint Description
OpenAI POST /v1/chat/completions Chat completions (streaming + non-streaming)
OpenAI POST /v1/completions Legacy text completions
OpenAI POST /v1/embeddings Text embeddings
OpenAI POST /v1/audio/speech Text-to-speech
OpenAI POST /v1/audio/transcriptions Speech-to-text
Anthropic POST /v1/messages Anthropic Messages API
Both GET /v1/models List available models

Key Capabilities

  • Unified adapter architecture - single ModelAdapter handles all model types with data-driven family configs (no subclass explosion)
  • 8 model families - Qwen, GLM-4, Llama, Gemma, Mistral/Devstral/Magistral, Liquid, Whisper, Kokoro
  • Smart model detection - auto-detects model type, family, and capabilities from config.json
  • Two-phase probe-then-serve - discovers tool-calling format, thinking delimiters, and streaming support at load time; hot path runs with zero capability checks
  • Persistent Metal GPU thread - dedicated thread with job queue ensures Metal thread affinity across all requests
  • Multi-model pool with LRU eviction - host text, vision, embeddings, and audio models simultaneously; auto-evict when memory pressure rises
  • Multi-protocol from one server - both OpenAI and Anthropic APIs from the same endpoint with bidirectional protocol translation
  • Continuous batching (experimental) with prefix caching and priority scheduling
  • Performance-optimized streaming - list-buffer string assembly (no O(n^2) concatenation), clean shutdown with drain timeout
  • Structured output - JSON mode with schema validation
  • Cloud routing - route to OpenAI/Anthropic cloud APIs when local models can't handle the request

Observability

  • Audit logging - privacy-first request metadata logging with WebSocket live streaming
  • Prometheus metrics - request latency, throughput, model memory, pool cache hits/misses
  • LogFire integration - distributed tracing with Pydantic LogFire
  • RFC 7807 errors - structured error responses with request ID correlation

See docs/MLX_SERVER.md for the full configuration reference, security guide, metrics list, and API documentation.

Features

Model Discovery

Browse and filter models with rich metadata:

  • Architecture badges - Llama, Qwen, Mistral, Gemma, Phi, and more
  • Quantization info - 4-bit, 8-bit quantization levels
  • Capability detection - Multimodal (vision), tool use support
  • Toggle view - Switch between your downloaded models and HuggingFace search

User Management

Secure multi-user support:

  • JWT authentication - Secure token-based auth
  • Admin controls - Approve/disable users, manage permissions
  • First-user admin - Initial user automatically becomes administrator
  • Rate limiting - Per-IP request throttling with token bucket algorithm

Server Monitoring

Real-time server metrics:

  • Memory usage and CPU/GPU utilization
  • Server uptime tracking
  • One-click start/stop/restart controls

Chat Interface

Rich conversation experience:

  • Multimodal support - Attach images, videos, and text files via drag-drop or button
  • Thinking models - Collapsible thinking panel for reasoning models (Qwen3, GLM-4, DeepSeek)
  • MCP tools - Built-in calculator and weather tools for testing tool-use models
  • System prompts - Configure default context per server profile

System Requirements

  • macOS 13+ with Apple Silicon (M1/M2/M3/M4)
  • Python 3.11 or 3.12
  • 8GB+ RAM (16GB+ recommended for larger models)

Commands

mlx-manager serve            # Start the web server
mlx-manager menubar          # Launch menubar app
mlx-manager install-service  # Auto-start on login
mlx-manager status           # Show running servers

Configuration

Environment variables (all optional):

Variable Default Description
MLX_MANAGER_DATABASE_PATH ~/.mlx-manager/mlx-manager.db Database location
MLX_MANAGER_DEFAULT_PORT_START 10240 Starting port for servers
MLX_MANAGER_JWT_SECRET Auto-generated JWT signing secret

MLX Server Configuration

The embedded MLX inference server accepts MLX_SERVER_* environment variables. All settings are opt-in with safe defaults -- zero configuration needed for local use.

Variable Default Description
MLX_SERVER_ADMIN_TOKEN none Bearer token for /v1/admin/* endpoints
MLX_SERVER_RATE_LIMIT_RPM 0 (off) Requests per minute per IP
MLX_SERVER_METRICS_ENABLED false Enable Prometheus metrics at /v1/admin/metrics
MLX_SERVER_MAX_MEMORY_GB 0 (auto) Model pool memory limit (0 = 75% of device RAM)
MLX_SERVER_MAX_MODELS 4 Max models loaded simultaneously
MLX_SERVER_TIMEOUT_CHAT_SECONDS 900 Chat completions timeout
MLX_SERVER_DRAIN_TIMEOUT_SECONDS 30 Graceful shutdown drain timeout

See docs/MLX_SERVER.md for the full configuration reference, security guide, metrics list, and API documentation.

Development

git clone https://github.com/tumma72/mlx-manager.git
cd mlx-manager
make install-dev  # Install dependencies
make dev          # Start dev servers
make test         # Run tests (4500+ tests)

License

MIT

Acknowledgments

Built on MLX, mlx-lm, mlx-vlm, mlx-embeddings, and mlx-audio.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_manager-1.2.12.tar.gz (611.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_manager-1.2.12-py3-none-any.whl (565.3 kB view details)

Uploaded Python 3

File details

Details for the file mlx_manager-1.2.12.tar.gz.

File metadata

  • Download URL: mlx_manager-1.2.12.tar.gz
  • Upload date:
  • Size: 611.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlx_manager-1.2.12.tar.gz
Algorithm Hash digest
SHA256 48d410a91e46aaf1584d6608d79f702d0450f1c136d437a7c8a6a24ad40eeafe
MD5 edbad7adab8d1f22228a555887240fcb
BLAKE2b-256 65ced8486b0b85a1484e0367aeb3826a703769efd669e06bb2685baca5674a62

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_manager-1.2.12.tar.gz:

Publisher: deploy_to_pypi.yml on tumma72/mlx-manager

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mlx_manager-1.2.12-py3-none-any.whl.

File metadata

  • Download URL: mlx_manager-1.2.12-py3-none-any.whl
  • Upload date:
  • Size: 565.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlx_manager-1.2.12-py3-none-any.whl
Algorithm Hash digest
SHA256 2efa52fc29f4cbbb088c6b881709a72935610bda4e2496487df47a30b8ddb64c
MD5 8e9c4cdeb77a4e19427b95ae0c071623
BLAKE2b-256 5a63c3248ff8558b203a00e1846e6facc329e2bdf9bee9b3ad024fdfa3bcb909

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_manager-1.2.12-py3-none-any.whl:

Publisher: deploy_to_pypi.yml on tumma72/mlx-manager

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page