Skip to main content

Web-based MLX model manager for Apple Silicon Macs

Project description

MLX Manager

CI Coverage PyPI version License: MIT

Python 3.11-3.12 FastAPI Svelte 5 TypeScript

Run and serve local LLMs on your Mac with one command. MLX Manager provides a web UI for managing MLX-optimized models on Apple Silicon, with an embedded high-performance inference server exposing both OpenAI and Anthropic-compatible APIs.

Why MLX Manager?

Running local LLMs typically requires juggling multiple tools, config files, and terminal commands. Tools like Ollama and LM Studio make model management easier, but they rely on llama.cpp — a cross-platform C++ runtime that treats Apple Silicon as one target among many. MLX Manager takes a different approach: it includes a purpose-built inference server that calls the MLX framework directly, so every operation runs natively on Metal GPU without translation layers, format conversions, or cross-platform abstractions.

  • One-click model downloads from HuggingFace MLX models (mlx-community, lmstudio-community, and more)
  • Smart model discovery - filter by architecture (Llama, Qwen, Mistral), quantization (4-bit, 8-bit), and capabilities (multimodal, tool use)
  • Purpose-built inference server - direct MLX framework integration with OpenAI and Anthropic API compatibility, not a wrapper around llama.cpp
  • Multi-model, multi-type - load text, vision, embeddings, and audio models simultaneously with LRU eviction. Not one-model-at-a-time
  • Two-phase model lifecycle - models are probed at load time to discover capabilities (tool calling, thinking, streaming), then served with zero runtime overhead
  • Visual server management - start, stop, and monitor models with real-time CPU/memory metrics
  • Rich chat interface - test models with image/video/text attachments, thinking model support, and MCP tool integration
  • Cloud routing - seamlessly route requests to OpenAI/Anthropic APIs when local models can't handle them
  • User authentication - secure multi-user access with JWT auth and admin controls
  • Background service - models auto-start on login via macOS launchd
  • Menubar app - quick access from your Mac's status bar

Quick Start

Install

# Homebrew (recommended)
brew tap tumma72/mlx-manager https://github.com/tumma72/mlx-manager
brew install mlx-manager

# Or via pip
pip install mlx-manager

Run

mlx-manager serve

Open http://localhost:10242 and you're ready to:

  1. Register - Create your account (first user becomes admin)
  2. Browse - Search HuggingFace for MLX-optimized models
  3. Filter - Find models by architecture, quantization, or capabilities
  4. Download - One-click download with progress tracking
  5. Configure - Create a server profile with custom settings
  6. Run - Start serving and chat with your model

Use as an API

Once a model is loaded, use it with any OpenAI or Anthropic client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:10242/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="mlx-community/Llama-3.2-3B-Instruct-4bit",
    messages=[{"role": "user", "content": "Hello!"}],
)
import anthropic

client = anthropic.Anthropic(base_url="http://localhost:10242/v1", api_key="not-needed")
message = client.messages.create(
    model="mlx-community/Llama-3.2-3B-Instruct-4bit",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}],
)

Embedded Inference Server

MLX Manager includes a fully self-contained inference server mounted at /v1. No external inference backends needed — the server calls mlx-lm, mlx-vlm, mlx-embeddings, and mlx-audio directly, keeping everything on the Metal GPU without process boundaries or IPC overhead.

Why build our own instead of wrapping Ollama or llama.cpp? Those projects target every platform and GPU vendor, which means abstraction layers, GGUF format conversions, and lowest-common-denominator threading. MLX Manager's server is Apple Silicon-only by design: it uses a persistent Metal GPU thread with a job queue for thread affinity, list-buffer string assembly to avoid O(n^2) concatenation in streaming, and a two-phase probe-then-serve lifecycle that eliminates capability checks from the hot path. The result is an inference server that behaves like a native macOS application — because it is one.

Multi-Protocol API

Protocol Endpoint Description
OpenAI POST /v1/chat/completions Chat completions (streaming + non-streaming)
OpenAI POST /v1/completions Legacy text completions
OpenAI POST /v1/embeddings Text embeddings
OpenAI POST /v1/audio/speech Text-to-speech
OpenAI POST /v1/audio/transcriptions Speech-to-text
Anthropic POST /v1/messages Anthropic Messages API
Both GET /v1/models List available models

Key Capabilities

  • Unified adapter architecture - single ModelAdapter handles all model types with data-driven family configs (no subclass explosion)
  • 8 model families - Qwen, GLM-4, Llama, Gemma, Mistral/Devstral/Magistral, Liquid, Whisper, Kokoro
  • Smart model detection - auto-detects model type, family, and capabilities from config.json
  • Two-phase probe-then-serve - discovers tool-calling format, thinking delimiters, and streaming support at load time; hot path runs with zero capability checks
  • Persistent Metal GPU thread - dedicated thread with job queue ensures Metal thread affinity across all requests
  • Multi-model pool with LRU eviction - host text, vision, embeddings, and audio models simultaneously; auto-evict when memory pressure rises
  • Multi-protocol from one server - both OpenAI and Anthropic APIs from the same endpoint with bidirectional protocol translation
  • Continuous batching (experimental) with prefix caching and priority scheduling
  • Performance-optimized streaming - list-buffer string assembly (no O(n^2) concatenation), clean shutdown with drain timeout
  • Structured output - JSON mode with schema validation
  • Cloud routing - route to OpenAI/Anthropic cloud APIs when local models can't handle the request

Observability

  • Audit logging - privacy-first request metadata logging with WebSocket live streaming
  • Prometheus metrics - request latency, throughput, model memory, pool cache hits/misses
  • LogFire integration - distributed tracing with Pydantic LogFire
  • RFC 7807 errors - structured error responses with request ID correlation

See docs/MLX_SERVER.md for the full configuration reference, security guide, metrics list, and API documentation.

Features

Model Discovery

Browse and filter models with rich metadata:

  • Architecture badges - Llama, Qwen, Mistral, Gemma, Phi, and more
  • Quantization info - 4-bit, 8-bit quantization levels
  • Capability detection - Multimodal (vision), tool use support
  • Toggle view - Switch between your downloaded models and HuggingFace search

User Management

Secure multi-user support:

  • JWT authentication - Secure token-based auth
  • Admin controls - Approve/disable users, manage permissions
  • First-user admin - Initial user automatically becomes administrator
  • Rate limiting - Per-IP request throttling with token bucket algorithm

Server Monitoring

Real-time server metrics:

  • Memory usage and CPU/GPU utilization
  • Server uptime tracking
  • One-click start/stop/restart controls

Chat Interface

Rich conversation experience:

  • Multimodal support - Attach images, videos, and text files via drag-drop or button
  • Thinking models - Collapsible thinking panel for reasoning models (Qwen3, GLM-4, DeepSeek)
  • MCP tools - Built-in calculator and weather tools for testing tool-use models
  • System prompts - Configure default context per server profile

System Requirements

  • macOS 13+ with Apple Silicon (M1/M2/M3/M4)
  • Python 3.11 or 3.12
  • 8GB+ RAM (16GB+ recommended for larger models)

Commands

mlx-manager serve            # Start the web server
mlx-manager menubar          # Launch menubar app
mlx-manager install-service  # Auto-start on login
mlx-manager status           # Show running servers

Configuration

Environment variables (all optional):

Variable Default Description
MLX_MANAGER_DATABASE_PATH ~/.mlx-manager/mlx-manager.db Database location
MLX_MANAGER_DEFAULT_PORT_START 10240 Starting port for servers
MLX_MANAGER_JWT_SECRET Auto-generated JWT signing secret

MLX Server Configuration

The embedded MLX inference server accepts MLX_SERVER_* environment variables. All settings are opt-in with safe defaults -- zero configuration needed for local use.

Variable Default Description
MLX_SERVER_ADMIN_TOKEN none Bearer token for /v1/admin/* endpoints
MLX_SERVER_RATE_LIMIT_RPM 0 (off) Requests per minute per IP
MLX_SERVER_METRICS_ENABLED false Enable Prometheus metrics at /v1/admin/metrics
MLX_SERVER_MAX_MEMORY_GB 0 (auto) Model pool memory limit (0 = 75% of device RAM)
MLX_SERVER_MAX_MODELS 4 Max models loaded simultaneously
MLX_SERVER_TIMEOUT_CHAT_SECONDS 900 Chat completions timeout
MLX_SERVER_DRAIN_TIMEOUT_SECONDS 30 Graceful shutdown drain timeout

See docs/MLX_SERVER.md for the full configuration reference, security guide, metrics list, and API documentation.

Development

git clone https://github.com/tumma72/mlx-manager.git
cd mlx-manager
make install-dev  # Install dependencies
make dev          # Start dev servers
make test         # Run tests (4500+ tests)

License

MIT

Acknowledgments

Built on MLX, mlx-lm, mlx-vlm, mlx-embeddings, and mlx-audio.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_manager-1.2.7.tar.gz (611.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_manager-1.2.7-py3-none-any.whl (565.3 kB view details)

Uploaded Python 3

File details

Details for the file mlx_manager-1.2.7.tar.gz.

File metadata

  • Download URL: mlx_manager-1.2.7.tar.gz
  • Upload date:
  • Size: 611.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlx_manager-1.2.7.tar.gz
Algorithm Hash digest
SHA256 f7b5110ad1eae9f6ace4869cf2ee756bc1fc82dabe8aeab9b1ed4fc0f8055d2c
MD5 521e10d48fee290606e2cf3527d675ab
BLAKE2b-256 beaa3be02cc6b167d1e0fc64b04ff5cc08dc1ebc2c106a979649c120c09fb0cd

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_manager-1.2.7.tar.gz:

Publisher: deploy_to_pypi.yml on tumma72/mlx-manager

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mlx_manager-1.2.7-py3-none-any.whl.

File metadata

  • Download URL: mlx_manager-1.2.7-py3-none-any.whl
  • Upload date:
  • Size: 565.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlx_manager-1.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 0580ec4bf562e2b34f7ba7f9dd70cf4e0fe6d50ff387aea1189ae9849a6d8cfc
MD5 eb8334be6a972a72a85ce2be0dde44f7
BLAKE2b-256 26d755a9477138629b45869084f836738711fbd2296457ef0e018762989a90e2

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_manager-1.2.7-py3-none-any.whl:

Publisher: deploy_to_pypi.yml on tumma72/mlx-manager

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page