CLI for running LLMs on Apple Silicon via MLX
Project description
ppmlx
Run LLMs on your Mac. OpenAI-compatible API powered by Apple Silicon.
Install
uv tool install ppmlx
Requires macOS on Apple Silicon (M1+) and Python 3.11+
Privacy note:
ppmlxnever sends prompts, responses, file contents, paths, or tokens anywhere. Optional anonymous usage analytics can be disabled withppmlx config --no-analytics.
Get Started
ppmlx pull qwen3.5:9b # download a model
ppmlx run qwen3.5:9b # chat in the terminal
ppmlx serve # start API server on :6767
curl | sh (one-liner)
curl -fsSL https://raw.githubusercontent.com/the-focus-company/ppmlx/main/scripts/install.sh | sh
From source
git clone https://github.com/the-focus-company/ppmlx
cd ppmlx
uv tool install .
Homebrew
Homebrew tap coming soon. For now, use uv tool install ppmlx.
Quick Start
# 1. Download a model
ppmlx pull llama3
# 2. Interactive chat REPL
ppmlx run llama3
# 3. Start OpenAI-compatible API server on :6767
ppmlx serve
Benchmarks
Measured on a MacBook Pro M4 Pro (48 GB unified memory, macOS 15.x). Each scenario was run 3 times with temperature=0 and max_tokens=8192; values below are averages.
GLM-4.7-Flash (4-bit, ~5 GB)
| Scenario | Metric | ppmlx | Ollama | Delta |
|---|---|---|---|---|
| Simple (short prompt, short answer) | tok/s | 63.1 | 40.5 | +56% |
| TTFT | 374 ms | 832 ms | -55% | |
| Complex (short prompt, long answer) | tok/s | 55.6 | 38.8 | +43% |
| TTFT | 496 ms | 412 ms | +20% | |
| Long context (~4 K token prompt) | tok/s | 42.1 | 27.5 | +53% |
| TTFT | 6,792 ms | 8,401 ms | -19% |
Qwen 3.5 9B (4-bit, ~6 GB)
| Scenario | Metric | ppmlx | Ollama | Delta |
|---|---|---|---|---|
| Simple | tok/s | 48.2 | 22.7 | +112% |
| TTFT | 537 ms | 324 ms | +66% | |
| Complex | tok/s | 47.2 | 23.0 | +106% |
| TTFT | 567 ms | 455 ms | +25% | |
| Long context | tok/s | 43.2 | 23.7 | +82% |
| TTFT | 9,212 ms | 11,461 ms | -20% |
tok/s = tokens per second (higher is better). TTFT = time to first token (lower is better). Delta is relative to Ollama.
Methodology. Streaming chat completions over the OpenAI-compatible API; TTFT measured from request start to first SSE content chunk. See scripts/bench_common.sh and the per-model scripts in scripts/ for the full, reproducible setup.
That's it. Any OpenAI-compatible tool works out of the box:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:6767/v1", api_key="local")
response = client.chat.completions.create(
model="qwen3.5:9b",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Commands
| Command | Description | Key Options |
|---|---|---|
ppmlx launch |
Interactive launcher (pick action + model) | -m model, --host, --port, --flush |
ppmlx serve |
Start API server on :6767 | -m model, --embed-model, -i, --no-cors |
ppmlx run <model> |
Interactive chat REPL | -s system, -t temp, --max-tokens |
ppmlx pull [model] |
Download model (multiselect if no arg) | --token |
ppmlx list |
Show downloaded models | -a all (incl. registry), --path |
ppmlx rm <model> |
Remove a model | -f skip confirmation |
ppmlx ps |
Show loaded models & memory | |
ppmlx quantize <model> |
Convert & quantize HF model to MLX | -b bits, --group-size, -o output |
ppmlx graph |
Open a local read-only web view of the temporal memory graph | --project, --session, --query, --json |
ppmlx memory status/search/list/handoff/compact-stats |
Inspect the experimental local temporal memory graph | --json, --status, --scope, --session |
ppmlx memory-eval |
Run the anti-garbage memory eval suite | --json, --dataset, --predictions |
ppmlx compact-eval |
Run long-session rolling-context compaction evals | --json, --output |
ppmlx answer-quality-eval |
Score compact-answer quality across recall, wrong facts, actionability, grounding, and A/B equivalence | --json, --dataset, --template |
ppmlx answer-quality-replay |
Run real Pi/Claude session quality eval through a live local ppmlx server | --model, --source, --base-url |
ppmlx quality-bench |
Split a real long session into 80% prefix / 20% holdout probes and compare local answers to recorded answers | --split, --max-probes, --model |
ppmlx trace export / ppmlx compact-replay |
Export and replay local traces through compact mode | --project, --session, --expect |
ppmlx config |
View/set configuration | --hf-token |
Connect Your Tools
Point any OpenAI-compatible client at http://localhost:6767/v1 with any API key:
- Cursor — Settings > AI > OpenAI-compatible
- Continue — config.json: provider
openai, apiBase above - LangChain / LlamaIndex — set
base_urlandapi_key="local"
Config
Optional. ~/.ppmlx/config.toml:
[server]
host = "127.0.0.1"
port = 6767
[defaults]
temperature = 0.7
max_tokens = 2048
[analytics]
enabled = true
provider = "posthog"
respect_do_not_track = true
Experimental local memory
Shadow-mode memory capture stores request/response events and high-precision memory candidates locally in ~/.ppmlx/memory.db. It does not inject memory into prompts yet.
[memory]
mode = "shadow" # off | shadow | compact | inject
# compact mode keeps a rolling prompt tail and renders scoped graph context
rolling_tokens = 10000
hot_tail_tokens = 6500
session_context_tokens = 2000
compact_threshold_tokens = 12000
max_context_items = 40
# optional graph-memory extraction worker pool
extractor = "llm" # off | heuristic | llm
extraction_model = "gemma-4-e2b"
extraction_workers = 1
extraction_max_tokens = 512
extraction_timeout_seconds = 60
Modes:
shadow: store events/candidates only; prompts are unchanged.compact: before inference, replace long histories with system context from the graph + a hot tail.inject: reserved for compact + broader memory retrieval.
Current graph-engine status: the extraction worker pool exists, while durable extraction_jobs and extracted atoms storage are being introduced and should be treated as experimental.
Compact observability is recorded locally in memory.db and, if analytics are enabled, sent as privacy-safe aggregate metrics to PostHog. It never sends prompts, responses, tool output, model repo IDs, project IDs, or session IDs.
Tool/MCP outputs are distilled through a plugin-style distiller interface. The built-in generic JSON distiller extracts small evidence-backed atoms such as candidates, prices, availability, specs, source URLs, and rejected items, while raw JSON stays local in the event log.
CLI:
ppmlx memory status
ppmlx memory search "concise answers"
ppmlx memory list --status active
ppmlx memory handoff --project tv-shopping --session tv-session-001
ppmlx memory compact-stats --since 24
ppmlx graph --project tv-shopping --session tv-session-001
ppmlx trace export --project tv-shopping --session tv-session-001 --output trace.json
ppmlx compact-replay trace.json --expect "budget = 5000 PLN"
ppmlx memory-eval
ppmlx compact-eval
ppmlx answer-quality-eval
ppmlx answer-quality-replay ~/.pi/agent/sessions/.../session.jsonl \
--model mlx-community/Qwopus3.5-4B-v3-4bit \
--base-url http://127.0.0.1:6767/v1
ppmlx quality-bench ~/.pi/agent/sessions/.../session.jsonl \
--split 0.8 --max-probes 5 \
--model mlx-community/Qwopus3.5-4B-v3-4bit
ppmlx graph serves a local read-only graph view. The current UI loads AntV G6 from a CDN, so the browser needs network access for graph rendering even though the memory data itself stays local.
answer-quality-replay requires a running local ppmlx server. It generates a compact answer and a local reference answer, selects question-relevant required facts, filters embedded examples/fixtures, and reports recall, wrong facts, actionability, grounding, and A/B equivalence.
quality-bench is the stronger quality benchmark: it splits a real transcript by episodes into prefix and held-out suffix, feeds only the compacted prefix plus held-out user turn to the local model, and scores the response against the recorded next assistant answer.
trace export is local-only and may include prompts, responses, and tool outputs. Keep exported traces private unless you intentionally want to share them.
Anonymous Usage Analytics
ppmlx supports privacy-preserving anonymous product analytics, disabled by default. On first interactive run, the beta onboarding asks whether you want to help by enabling it.
What is sent:
- command and API event names such as
serve_started,model_pulled,api_chat_completions - app version, Python minor version, OS family, CPU architecture
- a random anonymous install id, used only to count returning beta installs
- coarse booleans/counters such as
stream=true,tools=true,batch_size=4
What is never sent:
- prompts, responses, tool arguments, file contents, file paths
- HuggingFace tokens, API keys, repo IDs, model prompts, request bodies
When events are sent:
- when a CLI command starts
- when OpenAI-compatible API endpoints are hit
Why:
- understand which workflows matter most during beta
- prioritize compatibility work across commands and API surfaces
- measure adoption without collecting user content
Opt out:
ppmlx config --no-analytics
or:
[analytics]
enabled = false
By default, opted-in beta analytics are sent to the maintainer-operated PostHog project. To use your own PostHog sink instead, configure:
export PPMLX_ANALYTICS_HOST="https://analytics.example.com"
export PPMLX_ANALYTICS_PROJECT_API_KEY="your-posthog-project-api-key"
If you prefer, you can also set the same values in ~/.ppmlx/config.toml.
API Documentation
When the server is running, interactive API docs are available at:
- Swagger UI: http://localhost:6767/docs
- ReDoc: http://localhost:6767/redoc
Requirements
- macOS on Apple Silicon (M1 or later)
- Python 3.11+
- At least 8 GB unified memory (16 GB+ recommended for larger models)
ppmlx vs Ollama
| ppmlx | Ollama | |
|---|---|---|
| Runtime | MLX (Apple-native) | llama.cpp (cross-platform) |
| Platform | macOS Apple Silicon only | macOS, Linux, Windows |
| GPU backend | Metal (unified memory) | Metal / CUDA / ROCm |
| API | OpenAI-compatible | Ollama + OpenAI-compatible |
| Language | Python | Go + C++ |
| Quantization | MLX format | GGUF format |
Choose ppmlx if you want maximum Apple Silicon performance with a pure-Python, MLX-native stack. Choose Ollama if you need cross-platform support or GGUF models.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ppmlx-0.5.6.tar.gz.
File metadata
- Download URL: ppmlx-0.5.6.tar.gz
- Upload date:
- Size: 200.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
436e12e69a1f9f400835d9592c993b5e3989fb20bdbc18f699a8b5d095553f13
|
|
| MD5 |
5f45c50195e375d58012427252fd7731
|
|
| BLAKE2b-256 |
6fad05710bc1d7429709c5b5c522236ccd2a36c0a9a0eb4fef5ee371bc82f24a
|
Provenance
The following attestation bundles were made for ppmlx-0.5.6.tar.gz:
Publisher:
release.yml on the-focus-company/ppmlx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ppmlx-0.5.6.tar.gz -
Subject digest:
436e12e69a1f9f400835d9592c993b5e3989fb20bdbc18f699a8b5d095553f13 - Sigstore transparency entry: 1567174622
- Sigstore integration time:
-
Permalink:
the-focus-company/ppmlx@4dcbf5b75ee61f18ef78776ad0b90beedb1f4ace -
Branch / Tag:
refs/tags/v0.5.6 - Owner: https://github.com/the-focus-company
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4dcbf5b75ee61f18ef78776ad0b90beedb1f4ace -
Trigger Event:
push
-
Statement type:
File details
Details for the file ppmlx-0.5.6-py3-none-any.whl.
File metadata
- Download URL: ppmlx-0.5.6-py3-none-any.whl
- Upload date:
- Size: 177.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90d763604a38a1c7346927700f8f907a03a65665de8e7a6ccbca615ed1883f7c
|
|
| MD5 |
08226d55246493aac2561a42e5bf5d9f
|
|
| BLAKE2b-256 |
269da662940bb9ab82345f145ee3e56e82ea2a5249d303b1a0c1a1ab97c58744
|
Provenance
The following attestation bundles were made for ppmlx-0.5.6-py3-none-any.whl:
Publisher:
release.yml on the-focus-company/ppmlx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ppmlx-0.5.6-py3-none-any.whl -
Subject digest:
90d763604a38a1c7346927700f8f907a03a65665de8e7a6ccbca615ed1883f7c - Sigstore transparency entry: 1567174693
- Sigstore integration time:
-
Permalink:
the-focus-company/ppmlx@4dcbf5b75ee61f18ef78776ad0b90beedb1f4ace -
Branch / Tag:
refs/tags/v0.5.6 - Owner: https://github.com/the-focus-company
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4dcbf5b75ee61f18ef78776ad0b90beedb1f4ace -
Trigger Event:
push
-
Statement type: