Skip to main content

CLI for running LLMs on Apple Silicon via MLX

Project description

ppmlx

Run LLMs on your Mac. OpenAI-compatible API powered by Apple Silicon.

CI PyPI Python 3.11+ Platform License

Install

uv tool install ppmlx

Requires macOS on Apple Silicon (M1+) and Python 3.11+

Privacy note: ppmlx never sends prompts, responses, file contents, paths, or tokens anywhere. Optional anonymous usage analytics can be disabled with ppmlx config --no-analytics.

Get Started

ppmlx pull qwen3.5:9b      # download a model
ppmlx run qwen3.5:9b       # chat in the terminal
ppmlx serve                 # start API server on :6767

curl | sh (one-liner)

curl -fsSL https://raw.githubusercontent.com/the-focus-company/ppmlx/main/scripts/install.sh | sh

From source

git clone https://github.com/the-focus-company/ppmlx
cd ppmlx
uv tool install .

Homebrew

Homebrew tap coming soon. For now, use uv tool install ppmlx.


Quick Start

# 1. Download a model
ppmlx pull llama3

# 2. Interactive chat REPL
ppmlx run llama3

# 3. Start OpenAI-compatible API server on :6767
ppmlx serve

Benchmarks

Measured on a MacBook Pro M4 Pro (48 GB unified memory, macOS 15.x). Each scenario was run 3 times with temperature=0 and max_tokens=8192; values below are averages.

GLM-4.7-Flash (4-bit, ~5 GB)

Scenario Metric ppmlx Ollama Delta
Simple (short prompt, short answer) tok/s 63.1 40.5 +56%
TTFT 374 ms 832 ms -55%
Complex (short prompt, long answer) tok/s 55.6 38.8 +43%
TTFT 496 ms 412 ms +20%
Long context (~4 K token prompt) tok/s 42.1 27.5 +53%
TTFT 6,792 ms 8,401 ms -19%

Qwen 3.5 9B (4-bit, ~6 GB)

Scenario Metric ppmlx Ollama Delta
Simple tok/s 48.2 22.7 +112%
TTFT 537 ms 324 ms +66%
Complex tok/s 47.2 23.0 +106%
TTFT 567 ms 455 ms +25%
Long context tok/s 43.2 23.7 +82%
TTFT 9,212 ms 11,461 ms -20%

tok/s = tokens per second (higher is better). TTFT = time to first token (lower is better). Delta is relative to Ollama.

Methodology. Streaming chat completions over the OpenAI-compatible API; TTFT measured from request start to first SSE content chunk. See scripts/bench_common.sh and the per-model scripts in scripts/ for the full, reproducible setup.

That's it. Any OpenAI-compatible tool works out of the box:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:6767/v1", api_key="local")
response = client.chat.completions.create(
    model="qwen3.5:9b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Commands

Command Description Key Options
ppmlx launch Interactive launcher (pick action + model) -m model, --host, --port, --flush
ppmlx serve Start API server on :6767 -m model, --embed-model, -i, --no-cors
ppmlx run <model> Interactive chat REPL -s system, -t temp, --max-tokens
ppmlx pull [model] Download model (multiselect if no arg) --token
ppmlx list Show downloaded models -a all (incl. registry), --path
ppmlx rm <model> Remove a model -f skip confirmation
ppmlx ps Show loaded models & memory
ppmlx quantize <model> Convert & quantize HF model to MLX -b bits, --group-size, -o output
ppmlx config View/set configuration --hf-token

Connect Your Tools

Point any OpenAI-compatible client at http://localhost:6767/v1 with any API key:

  • Cursor — Settings > AI > OpenAI-compatible
  • Continue — config.json: provider openai, apiBase above
  • LangChain / LlamaIndex — set base_url and api_key="local"

Config

Optional. ~/.ppmlx/config.toml:

[server]
host = "127.0.0.1"
port = 6767

[defaults]
temperature = 0.7
max_tokens = 2048

[analytics]
enabled = true
provider = "posthog"
respect_do_not_track = true

Anonymous Usage Analytics

ppmlx supports privacy-preserving anonymous product analytics, disabled by default. On first interactive run, the beta onboarding asks whether you want to help by enabling it.

What is sent:

  • command and API event names such as serve_started, model_pulled, api_chat_completions
  • app version, Python minor version, OS family, CPU architecture
  • a random anonymous install id, used only to count returning beta installs
  • coarse booleans/counters such as stream=true, tools=true, batch_size=4

What is never sent:

  • prompts, responses, tool arguments, file contents, file paths
  • HuggingFace tokens, API keys, repo IDs, model prompts, request bodies

When events are sent:

  • when a CLI command starts
  • when OpenAI-compatible API endpoints are hit

Why:

  • understand which workflows matter most during beta
  • prioritize compatibility work across commands and API surfaces
  • measure adoption without collecting user content

Opt out:

ppmlx config --no-analytics

or:

[analytics]
enabled = false

By default, opted-in beta analytics are sent to the maintainer-operated PostHog project. To use your own PostHog sink instead, configure:

export PPMLX_ANALYTICS_HOST="https://analytics.example.com"
export PPMLX_ANALYTICS_PROJECT_API_KEY="your-posthog-project-api-key"

If you prefer, you can also set the same values in ~/.ppmlx/config.toml.

API Documentation

When the server is running, interactive API docs are available at:

Requirements

  • macOS on Apple Silicon (M1 or later)
  • Python 3.11+
  • At least 8 GB unified memory (16 GB+ recommended for larger models)

ppmlx vs Ollama

ppmlx Ollama
Runtime MLX (Apple-native) llama.cpp (cross-platform)
Platform macOS Apple Silicon only macOS, Linux, Windows
GPU backend Metal (unified memory) Metal / CUDA / ROCm
API OpenAI-compatible Ollama + OpenAI-compatible
Language Python Go + C++
Quantization MLX format GGUF format

Choose ppmlx if you want maximum Apple Silicon performance with a pure-Python, MLX-native stack. Choose Ollama if you need cross-platform support or GGUF models.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ppmlx-0.5.2.tar.gz (104.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ppmlx-0.5.2-py3-none-any.whl (93.1 kB view details)

Uploaded Python 3

File details

Details for the file ppmlx-0.5.2.tar.gz.

File metadata

  • Download URL: ppmlx-0.5.2.tar.gz
  • Upload date:
  • Size: 104.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ppmlx-0.5.2.tar.gz
Algorithm Hash digest
SHA256 c8bdef39120de294424b3601205d930ae89e82bab073eda33123f6681aa92b6e
MD5 2f9b66104a60852e3c60e7163a2d2b98
BLAKE2b-256 c3c95218675ba6d49d6c34129d1c29a4c6cf49a4f8d61f40909d8aa33aefc00a

See more details on using hashes here.

Provenance

The following attestation bundles were made for ppmlx-0.5.2.tar.gz:

Publisher: release.yml on the-focus-company/ppmlx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ppmlx-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: ppmlx-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 93.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ppmlx-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b08c720acbb17422eff1669747252cc61d36d4517fe9b1eaa3fb6f4c222e592d
MD5 bee6942a846059629b972956003ff7f1
BLAKE2b-256 cd692e90ddd6e6d46f0fba8df1f33907c7ebca0e6f9efc13638a46fad8d4c79e

See more details on using hashes here.

Provenance

The following attestation bundles were made for ppmlx-0.5.2-py3-none-any.whl:

Publisher: release.yml on the-focus-company/ppmlx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page