Skip to main content

CLI for running LLMs on Apple Silicon via MLX

Project description

ppmlx

Run LLMs on your Mac. OpenAI-compatible API powered by Apple Silicon.

CI PyPI Python 3.11+ Platform License

Install

uv tool install ppmlx

Requires macOS on Apple Silicon (M1+) and Python 3.11+

Privacy note: ppmlx never sends prompts, responses, file contents, paths, or tokens anywhere. Optional anonymous usage analytics can be disabled with ppmlx config --no-analytics.

Get Started

ppmlx pull qwen3.5:9b      # download a model
ppmlx run qwen3.5:9b       # chat in the terminal
ppmlx serve                 # start API server on :6767

curl | sh (one-liner)

curl -fsSL https://raw.githubusercontent.com/the-focus-company/ppmlx/main/scripts/install.sh | sh

From source

git clone https://github.com/the-focus-company/ppmlx
cd ppmlx
uv tool install .

Homebrew

Homebrew tap coming soon. For now, use uv tool install ppmlx.


Quick Start

# 1. Download a model
ppmlx pull llama3

# 2. Interactive chat REPL
ppmlx run llama3

# 3. Start OpenAI-compatible API server on :6767
ppmlx serve

Benchmarks

Measured on a MacBook Pro M4 Pro (48 GB unified memory, macOS 15.x). Each scenario was run 3 times with temperature=0 and max_tokens=8192; values below are averages.

GLM-4.7-Flash (4-bit, ~5 GB)

Scenario Metric ppmlx Ollama Delta
Simple (short prompt, short answer) tok/s 63.1 40.5 +56%
TTFT 374 ms 832 ms -55%
Complex (short prompt, long answer) tok/s 55.6 38.8 +43%
TTFT 496 ms 412 ms +20%
Long context (~4 K token prompt) tok/s 42.1 27.5 +53%
TTFT 6,792 ms 8,401 ms -19%

Qwen 3.5 9B (4-bit, ~6 GB)

Scenario Metric ppmlx Ollama Delta
Simple tok/s 48.2 22.7 +112%
TTFT 537 ms 324 ms +66%
Complex tok/s 47.2 23.0 +106%
TTFT 567 ms 455 ms +25%
Long context tok/s 43.2 23.7 +82%
TTFT 9,212 ms 11,461 ms -20%

tok/s = tokens per second (higher is better). TTFT = time to first token (lower is better). Delta is relative to Ollama.

Methodology. Streaming chat completions over the OpenAI-compatible API; TTFT measured from request start to first SSE content chunk. See scripts/bench_common.sh and the per-model scripts in scripts/ for the full, reproducible setup.

That's it. Any OpenAI-compatible tool works out of the box:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:6767/v1", api_key="local")
response = client.chat.completions.create(
    model="qwen3.5:9b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Commands

Command Description Key Options
ppmlx launch Interactive launcher (pick action + model) -m model, --host, --port, --flush
ppmlx serve Start API server on :6767 -m model, --embed-model, -i, --no-cors
ppmlx run <model> Interactive chat REPL -s system, -t temp, --max-tokens
ppmlx pull [model] Download model (multiselect if no arg) --token
ppmlx list Show downloaded models -a all (incl. registry), --path
ppmlx rm <model> Remove a model -f skip confirmation
ppmlx ps Show loaded models & memory
ppmlx quantize <model> Convert & quantize HF model to MLX -b bits, --group-size, -o output
ppmlx config View/set configuration --hf-token

Connect Your Tools

Point any OpenAI-compatible client at http://localhost:6767/v1 with any API key:

  • Cursor — Settings > AI > OpenAI-compatible
  • Continue — config.json: provider openai, apiBase above
  • LangChain / LlamaIndex — set base_url and api_key="local"

Config

Optional. ~/.ppmlx/config.toml:

[server]
host = "127.0.0.1"
port = 6767

[defaults]
temperature = 0.7
max_tokens = 2048

[analytics]
enabled = true
provider = "posthog"
respect_do_not_track = true

Anonymous Usage Analytics

ppmlx supports privacy-preserving anonymous product analytics, disabled by default. On first interactive run, the beta onboarding asks whether you want to help by enabling it.

What is sent:

  • command and API event names such as serve_started, model_pulled, api_chat_completions
  • app version, Python minor version, OS family, CPU architecture
  • a random anonymous install id, used only to count returning beta installs
  • coarse booleans/counters such as stream=true, tools=true, batch_size=4

What is never sent:

  • prompts, responses, tool arguments, file contents, file paths
  • HuggingFace tokens, API keys, repo IDs, model prompts, request bodies

When events are sent:

  • when a CLI command starts
  • when OpenAI-compatible API endpoints are hit

Why:

  • understand which workflows matter most during beta
  • prioritize compatibility work across commands and API surfaces
  • measure adoption without collecting user content

Opt out:

ppmlx config --no-analytics

or:

[analytics]
enabled = false

By default, opted-in beta analytics are sent to the maintainer-operated PostHog project. To use your own PostHog sink instead, configure:

export PPMLX_ANALYTICS_HOST="https://analytics.example.com"
export PPMLX_ANALYTICS_PROJECT_API_KEY="your-posthog-project-api-key"

If you prefer, you can also set the same values in ~/.ppmlx/config.toml.

API Documentation

When the server is running, interactive API docs are available at:

Requirements

  • macOS on Apple Silicon (M1 or later)
  • Python 3.11+
  • At least 8 GB unified memory (16 GB+ recommended for larger models)

ppmlx vs Ollama

ppmlx Ollama
Runtime MLX (Apple-native) llama.cpp (cross-platform)
Platform macOS Apple Silicon only macOS, Linux, Windows
GPU backend Metal (unified memory) Metal / CUDA / ROCm
API OpenAI-compatible Ollama + OpenAI-compatible
Language Python Go + C++
Quantization MLX format GGUF format

Choose ppmlx if you want maximum Apple Silicon performance with a pure-Python, MLX-native stack. Choose Ollama if you need cross-platform support or GGUF models.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ppmlx-0.5.1.tar.gz (104.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ppmlx-0.5.1-py3-none-any.whl (93.1 kB view details)

Uploaded Python 3

File details

Details for the file ppmlx-0.5.1.tar.gz.

File metadata

  • Download URL: ppmlx-0.5.1.tar.gz
  • Upload date:
  • Size: 104.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ppmlx-0.5.1.tar.gz
Algorithm Hash digest
SHA256 f4bc44ffe779a58a3b90b9cff4be890c5b2be6e711a58423d9be86b66c2d435e
MD5 8f1a9d32920e291e3a02be03c6492f75
BLAKE2b-256 abde6edd10093407cbe24576fcaddb58507c60cd87e8abadd63dc908d253eca6

See more details on using hashes here.

Provenance

The following attestation bundles were made for ppmlx-0.5.1.tar.gz:

Publisher: release.yml on the-focus-company/ppmlx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ppmlx-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: ppmlx-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 93.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ppmlx-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b0019fb77f88058a8667c25128a5a12348613db6ff6ca7c07f69921aea554c87
MD5 5c29343b31080cb4681c334513b3f66a
BLAKE2b-256 7394aa3fc32920fc38e9843661ef63bddd07aff25dea90bcaf51352198ce8114

See more details on using hashes here.

Provenance

The following attestation bundles were made for ppmlx-0.5.1-py3-none-any.whl:

Publisher: release.yml on the-focus-company/ppmlx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page