CLI for running LLMs on Apple Silicon via MLX
Project description
ppmlx
CLI for running LLMs on Apple Silicon via MLX — OpenAI-compatible API on port 6767.
Install
Requires: macOS on Apple Silicon (M1/M2/M3/M4), Python 3.11+
uv (recommended)
uv tool install ppmlx
pipx
pipx install ppmlx
curl | sh (one-liner)
curl -fsSL https://raw.githubusercontent.com/PingCompany/ppmlx/main/scripts/install.sh | sh
From source
git clone https://github.com/PingCompany/ppmlx
cd ppmlx
uv tool install .
Homebrew
Homebrew tap coming soon. For now, use uv tool install ppmlx.
Quick Start
# 1. Download a model
ppmlx pull llama3
# 2. Interactive chat REPL
ppmlx run llama3
# 3. Start OpenAI-compatible API server on :6767
ppmlx serve
OpenAI SDK Example
from openai import OpenAI
client = OpenAI(base_url="http://localhost:6767/v1", api_key="local")
response = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": "Hello, how are you?"}],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
curl Example
# List available models
curl http://localhost:6767/v1/models
# Chat completion
curl http://localhost:6767/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"messages": [{"role": "user", "content": "What is Apple Silicon?"}],
"stream": false
}'
# Embeddings
curl http://localhost:6767/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model": "nomic-embed", "input": "Hello world"}'
Model Aliases
Llama Family
| Alias | HuggingFace Repo |
|---|---|
llama3 |
mlx-community/Meta-Llama-3-8B-Instruct-4bit |
llama3-70b |
mlx-community/Meta-Llama-3-70B-Instruct-4bit |
llama3.2 |
mlx-community/Llama-3.2-3B-Instruct-4bit |
llama3.1 |
mlx-community/Meta-Llama-3.1-8B-Instruct-4bit |
Mistral / Mixtral Family
| Alias | HuggingFace Repo |
|---|---|
mistral |
mlx-community/Mistral-7B-Instruct-v0.3-4bit |
mixtral |
mlx-community/Mixtral-8x7B-Instruct-v0.1-4bit |
mistral-nemo |
mlx-community/Mistral-Nemo-Instruct-2407-4bit |
Qwen Family
| Alias | HuggingFace Repo |
|---|---|
qwen2.5 |
mlx-community/Qwen2.5-7B-Instruct-4bit |
qwen2.5-14b |
mlx-community/Qwen2.5-14B-Instruct-4bit |
qwen2.5-72b |
mlx-community/Qwen2.5-72B-Instruct-4bit |
Phi / Gemma Family
| Alias | HuggingFace Repo |
|---|---|
phi4 |
mlx-community/phi-4-4bit |
phi3.5 |
mlx-community/Phi-3.5-mini-instruct-4bit |
gemma2 |
mlx-community/gemma-2-9b-it-4bit |
gemma2-27b |
mlx-community/gemma-2-27b-it-4bit |
Code Models
| Alias | HuggingFace Repo |
|---|---|
codellama |
mlx-community/CodeLlama-13b-Instruct-hf-4bit |
deepseek-coder |
mlx-community/deepseek-coder-6.7b-instruct-4bit |
Embedding Models
| Alias | HuggingFace Repo |
|---|---|
nomic-embed |
mlx-community/nomic-embed-text-v1.5 |
bge-small |
mlx-community/bge-small-en-v1.5 |
RAM Requirements
| Model Size | Min RAM | Recommended RAM | Notes |
|---|---|---|---|
| 1-3B params | 4 GB | 8 GB | Llama 3.2 3B, Phi 3.5 mini |
| 7-8B params | 8 GB | 16 GB | Llama 3 8B, Mistral 7B |
| 13-14B | 16 GB | 24 GB | CodeLlama 13B, Qwen 2.5 14B |
| 27-34B | 24 GB | 36 GB | Gemma 2 27B |
| 70-72B | 48 GB | 64 GB | Llama 3 70B, Qwen 2.5 72B |
All values are for 4-bit quantized models. Unquantized models require 2-4x more RAM.
CLI Commands
| Command | Description |
|---|---|
ppmlx pull <model> |
Download a model from HuggingFace Hub |
ppmlx run <model> |
Start interactive chat REPL |
ppmlx serve |
Start OpenAI-compatible API server on :6767 |
ppmlx list |
List locally downloaded models |
ppmlx rm <model> |
Remove a downloaded model |
ppmlx alias <n> <repo> |
Add a custom model alias |
ppmlx aliases |
Show all model aliases (built-in + custom) |
ppmlx ps |
Show currently loaded models and memory usage |
ppmlx quantize |
Convert and quantize a model to MLX format |
ppmlx create |
Create a custom model from a Modelfile |
ppmlx logs |
Query the request log database |
ppmlx info <model> |
Show detailed model information |
ppmlx estimate <m> |
Estimate RAM requirements before downloading |
Modelfile Example
Create a Modelfile to define a custom model with a system prompt:
FROM llama3
SYSTEM """
You are a helpful coding assistant. You write clean, well-documented code
and explain your reasoning step by step.
"""
PARAMETER temperature 0.2
PARAMETER max_tokens 4096
PARAMETER top_p 0.9
Then build it:
ppmlx create coding-assistant -f Modelfile
ppmlx run coding-assistant
Configuration
ppmlx reads configuration from ~/.ppmlx/config.toml. All values are optional.
[server]
host = "127.0.0.1" # Bind address (default: 127.0.0.1)
port = 6767 # Port (default: 6767)
cors = true # Enable CORS (default: true)
cors_origins = ["*"] # Allowed CORS origins
[models]
dir = "~/.ppmlx/models" # Model storage directory
default_alias = "llama3" # Default model for bare requests
[generation]
temperature = 0.7 # Default sampling temperature
max_tokens = 2048 # Default max output tokens
top_p = 0.9 # Default top-p
repetition_penalty = 1.1
[logging]
db_path = "~/.ppmlx/ppmlx.db" # SQLite log database
log_requests = true # Log all requests
log_level = "info" # Server log level
Architecture
┌─────────────────────────────────────────────────────┐
│ ppmlx CLI │
│ (typer + rich) │
│ pull / run / serve / list / rm / quantize / ... │
└───────────────────┬─────────────────────────────────┘
│
┌──────────▼──────────┐
│ FastAPI Server │
│ port :6767 │
│ │
│ /v1/chat/completions│
│ /v1/completions │
│ /v1/embeddings │
│ /v1/models │
│ /health /metrics │
└──────┬──────┬───────┘
│ │
┌─────────▼──┐ ┌─▼──────────────┐
│ LLM Engine│ │ Embed Engine │
│ (mlx-lm) │ │(mlx-embeddings)│
└─────────┬──┘ └─┬──────────────┘
│ │
┌─────────▼───────▼──────────────┐
│ MLX / Metal GPU │
│ Apple Silicon Unified Memory │
└────────────────────────────────┘
│
┌─────────────▼──────────────────┐
│ SQLite Request Log │
│ ~/.ppmlx/ppmlx.db │
└────────────────────────────────┘
Uninstall
uv
uv tool uninstall ppmlx
pipx
pipx uninstall ppmlx
Manual cleanup (all methods)
# Remove downloaded models and config
rm -rf ~/.ppmlx
Contributing
- Fork the repository on GitHub.
- Create a feature branch:
git checkout -b feat/my-feature - Install dev dependencies:
uv sync --python 3.11 - Run tests:
uv run pytest tests/ -v - Submit a pull request.
Development Setup
git clone https://github.com/PingCompany/ppmlx
cd ppmlx
uv sync --python 3.11
uv run ppmlx --version
uv run pytest tests/ -v
Project Structure
ppmlx/
__init__.py # version
cli.py # Typer CLI entry point
server.py # FastAPI application
engine.py # MLX LLM inference engine
engine_embed.py # MLX embedding engine
engine_vlm.py # MLX vision-language engine
models.py # model registry, aliases, download
config.py # configuration loading
db.py # SQLite request logging
memory.py # RAM estimation utilities
modelfile.py # Modelfile parser
quantize.py # MLX quantization helpers
tests/
conftest.py # MLX stubs for CI
test_cli.py # CLI tests
scripts/
install.sh # One-liner installer
homebrew/
Formula/ppmlx.rb # Homebrew formula
.github/workflows/
tests.yml # CI tests
release.yml # PyPI release
homebrew-update.yml
License
MIT — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ppmlx-0.1.0.tar.gz.
File metadata
- Download URL: ppmlx-0.1.0.tar.gz
- Upload date:
- Size: 89.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22f94d51c01930f8f2dd865bca022cbdb7711afc70fdc87e663b61121edbecff
|
|
| MD5 |
61b5071100e1a32759924aae644432d8
|
|
| BLAKE2b-256 |
ee0a433431922f521f2bab089399880f59768cc3b1ecc4d53848be287bfdb26b
|
Provenance
The following attestation bundles were made for ppmlx-0.1.0.tar.gz:
Publisher:
release.yml on PingCompany/ppmlx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ppmlx-0.1.0.tar.gz -
Subject digest:
22f94d51c01930f8f2dd865bca022cbdb7711afc70fdc87e663b61121edbecff - Sigstore transparency entry: 1182365019
- Sigstore integration time:
-
Permalink:
PingCompany/ppmlx@68331d5cee18fed2fd30b36cfb4210489f690c6a -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/PingCompany
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@68331d5cee18fed2fd30b36cfb4210489f690c6a -
Trigger Event:
push
-
Statement type:
File details
Details for the file ppmlx-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ppmlx-0.1.0-py3-none-any.whl
- Upload date:
- Size: 67.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfb4ff1c928571976fbe4688d541d4b069dba8d2023dfe8554c7ee0736b50910
|
|
| MD5 |
07876a06c112ea92f73a68d8825ebc90
|
|
| BLAKE2b-256 |
3697ce9040e24d7c7b78da40763c7192ba41d9d78211fb3a712588baa45beb3d
|
Provenance
The following attestation bundles were made for ppmlx-0.1.0-py3-none-any.whl:
Publisher:
release.yml on PingCompany/ppmlx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ppmlx-0.1.0-py3-none-any.whl -
Subject digest:
dfb4ff1c928571976fbe4688d541d4b069dba8d2023dfe8554c7ee0736b50910 - Sigstore transparency entry: 1182365021
- Sigstore integration time:
-
Permalink:
PingCompany/ppmlx@68331d5cee18fed2fd30b36cfb4210489f690c6a -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/PingCompany
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@68331d5cee18fed2fd30b36cfb4210489f690c6a -
Trigger Event:
push
-
Statement type: