Skip to main content

CLI for running LLMs on Apple Silicon via MLX

Project description

ppmlx

CLI for running LLMs on Apple Silicon via MLX — OpenAI-compatible API on port 6767.

Python 3.11+ Platform License


Install

Requires: macOS on Apple Silicon (M1/M2/M3/M4), Python 3.11+

uv (recommended)

uv tool install ppmlx

pipx

pipx install ppmlx

curl | sh (one-liner)

curl -fsSL https://raw.githubusercontent.com/PingCompany/ppmlx/main/scripts/install.sh | sh

From source

git clone https://github.com/PingCompany/ppmlx
cd ppmlx
uv tool install .

Homebrew

Homebrew tap coming soon. For now, use uv tool install ppmlx.


Quick Start

# 1. Download a model
ppmlx pull llama3

# 2. Interactive chat REPL
ppmlx run llama3

# 3. Start OpenAI-compatible API server on :6767
ppmlx serve

OpenAI SDK Example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:6767/v1", api_key="local")

response = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

curl Example

# List available models
curl http://localhost:6767/v1/models

# Chat completion
curl http://localhost:6767/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [{"role": "user", "content": "What is Apple Silicon?"}],
    "stream": false
  }'

# Embeddings
curl http://localhost:6767/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "nomic-embed", "input": "Hello world"}'

Model Aliases

Llama Family

Alias HuggingFace Repo
llama3 mlx-community/Meta-Llama-3-8B-Instruct-4bit
llama3-70b mlx-community/Meta-Llama-3-70B-Instruct-4bit
llama3.2 mlx-community/Llama-3.2-3B-Instruct-4bit
llama3.1 mlx-community/Meta-Llama-3.1-8B-Instruct-4bit

Mistral / Mixtral Family

Alias HuggingFace Repo
mistral mlx-community/Mistral-7B-Instruct-v0.3-4bit
mixtral mlx-community/Mixtral-8x7B-Instruct-v0.1-4bit
mistral-nemo mlx-community/Mistral-Nemo-Instruct-2407-4bit

Qwen Family

Alias HuggingFace Repo
qwen2.5 mlx-community/Qwen2.5-7B-Instruct-4bit
qwen2.5-14b mlx-community/Qwen2.5-14B-Instruct-4bit
qwen2.5-72b mlx-community/Qwen2.5-72B-Instruct-4bit

Phi / Gemma Family

Alias HuggingFace Repo
phi4 mlx-community/phi-4-4bit
phi3.5 mlx-community/Phi-3.5-mini-instruct-4bit
gemma2 mlx-community/gemma-2-9b-it-4bit
gemma2-27b mlx-community/gemma-2-27b-it-4bit

Code Models

Alias HuggingFace Repo
codellama mlx-community/CodeLlama-13b-Instruct-hf-4bit
deepseek-coder mlx-community/deepseek-coder-6.7b-instruct-4bit

Embedding Models

Alias HuggingFace Repo
nomic-embed mlx-community/nomic-embed-text-v1.5
bge-small mlx-community/bge-small-en-v1.5

RAM Requirements

Model Size Min RAM Recommended RAM Notes
1-3B params 4 GB 8 GB Llama 3.2 3B, Phi 3.5 mini
7-8B params 8 GB 16 GB Llama 3 8B, Mistral 7B
13-14B 16 GB 24 GB CodeLlama 13B, Qwen 2.5 14B
27-34B 24 GB 36 GB Gemma 2 27B
70-72B 48 GB 64 GB Llama 3 70B, Qwen 2.5 72B

All values are for 4-bit quantized models. Unquantized models require 2-4x more RAM.


CLI Commands

Command Description
ppmlx pull <model> Download a model from HuggingFace Hub
ppmlx run <model> Start interactive chat REPL
ppmlx serve Start OpenAI-compatible API server on :6767
ppmlx list List locally downloaded models
ppmlx rm <model> Remove a downloaded model
ppmlx alias <n> <repo> Add a custom model alias
ppmlx aliases Show all model aliases (built-in + custom)
ppmlx ps Show currently loaded models and memory usage
ppmlx quantize Convert and quantize a model to MLX format
ppmlx create Create a custom model from a Modelfile
ppmlx logs Query the request log database
ppmlx info <model> Show detailed model information
ppmlx estimate <m> Estimate RAM requirements before downloading

Modelfile Example

Create a Modelfile to define a custom model with a system prompt:

FROM llama3

SYSTEM """
You are a helpful coding assistant. You write clean, well-documented code
and explain your reasoning step by step.
"""

PARAMETER temperature 0.2
PARAMETER max_tokens 4096
PARAMETER top_p 0.9

Then build it:

ppmlx create coding-assistant -f Modelfile
ppmlx run coding-assistant

Configuration

ppmlx reads configuration from ~/.ppmlx/config.toml. All values are optional.

[server]
host = "127.0.0.1"      # Bind address (default: 127.0.0.1)
port = 6767             # Port (default: 6767)
cors = true             # Enable CORS (default: true)
cors_origins = ["*"]    # Allowed CORS origins

[models]
dir = "~/.ppmlx/models"   # Model storage directory
default_alias = "llama3"    # Default model for bare requests

[generation]
temperature = 0.7       # Default sampling temperature
max_tokens = 2048       # Default max output tokens
top_p = 0.9             # Default top-p
repetition_penalty = 1.1

[logging]
db_path = "~/.ppmlx/ppmlx.db"   # SQLite log database
log_requests = true                 # Log all requests
log_level = "info"                  # Server log level

Architecture

┌─────────────────────────────────────────────────────┐
│                    ppmlx CLI                       │
│  (typer + rich)                                     │
│  pull / run / serve / list / rm / quantize / ...    │
└───────────────────┬─────────────────────────────────┘
                    │
         ┌──────────▼──────────┐
         │   FastAPI Server    │
         │   port :6767        │
         │                     │
         │  /v1/chat/completions│
         │  /v1/completions    │
         │  /v1/embeddings     │
         │  /v1/models         │
         │  /health /metrics   │
         └──────┬──────┬───────┘
                │      │
      ┌─────────▼──┐ ┌─▼──────────────┐
      │  LLM Engine│ │  Embed Engine  │
      │  (mlx-lm)  │ │(mlx-embeddings)│
      └─────────┬──┘ └─┬──────────────┘
                │       │
      ┌─────────▼───────▼──────────────┐
      │       MLX / Metal GPU          │
      │   Apple Silicon Unified Memory │
      └────────────────────────────────┘
                    │
      ┌─────────────▼──────────────────┐
      │   SQLite Request Log           │
      │   ~/.ppmlx/ppmlx.db          │
      └────────────────────────────────┘

Uninstall

uv

uv tool uninstall ppmlx

pipx

pipx uninstall ppmlx

Manual cleanup (all methods)

# Remove downloaded models and config
rm -rf ~/.ppmlx

Contributing

  1. Fork the repository on GitHub.
  2. Create a feature branch: git checkout -b feat/my-feature
  3. Install dev dependencies: uv sync --python 3.11
  4. Run tests: uv run pytest tests/ -v
  5. Submit a pull request.

Development Setup

git clone https://github.com/PingCompany/ppmlx
cd ppmlx
uv sync --python 3.11
uv run ppmlx --version
uv run pytest tests/ -v

Project Structure

ppmlx/
  __init__.py       # version
  cli.py            # Typer CLI entry point
  server.py         # FastAPI application
  engine.py         # MLX LLM inference engine
  engine_embed.py   # MLX embedding engine
  engine_vlm.py     # MLX vision-language engine
  models.py         # model registry, aliases, download
  config.py         # configuration loading
  db.py             # SQLite request logging
  memory.py         # RAM estimation utilities
  modelfile.py      # Modelfile parser
  quantize.py       # MLX quantization helpers
tests/
  conftest.py       # MLX stubs for CI
  test_cli.py       # CLI tests
scripts/
  install.sh        # One-liner installer
homebrew/
  Formula/ppmlx.rb # Homebrew formula
.github/workflows/
  tests.yml         # CI tests
  release.yml       # PyPI release
  homebrew-update.yml

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ppmlx-0.1.0.tar.gz (89.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ppmlx-0.1.0-py3-none-any.whl (67.8 kB view details)

Uploaded Python 3

File details

Details for the file ppmlx-0.1.0.tar.gz.

File metadata

  • Download URL: ppmlx-0.1.0.tar.gz
  • Upload date:
  • Size: 89.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ppmlx-0.1.0.tar.gz
Algorithm Hash digest
SHA256 22f94d51c01930f8f2dd865bca022cbdb7711afc70fdc87e663b61121edbecff
MD5 61b5071100e1a32759924aae644432d8
BLAKE2b-256 ee0a433431922f521f2bab089399880f59768cc3b1ecc4d53848be287bfdb26b

See more details on using hashes here.

Provenance

The following attestation bundles were made for ppmlx-0.1.0.tar.gz:

Publisher: release.yml on PingCompany/ppmlx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ppmlx-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ppmlx-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 67.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ppmlx-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dfb4ff1c928571976fbe4688d541d4b069dba8d2023dfe8554c7ee0736b50910
MD5 07876a06c112ea92f73a68d8825ebc90
BLAKE2b-256 3697ce9040e24d7c7b78da40763c7192ba41d9d78211fb3a712588baa45beb3d

See more details on using hashes here.

Provenance

The following attestation bundles were made for ppmlx-0.1.0-py3-none-any.whl:

Publisher: release.yml on PingCompany/ppmlx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page