Ollama-style daemon and CLI over vllm-mlx on Apple Silicon
Project description
vllmlx
Ollama-style daemon and CLI for vllm-mlx.
Features
- 🚀 Always-on daemon - API available immediately after install, survives reboots
- 🎯 Simple CLI -
vllmlx pull,vllmlx run,vllmlx ls- familiar Ollama-style commands - 🔄 Hot-swap models - Switch models on-the-fly without restarting
- 💾 Smart memory - Auto-unloads models after idle timeout
- 🤖 OpenAI-compatible API - Works with existing tools at
localhost:8000
Quick Start
# Install with uv (recommended)
uv tool install vllmlx
# Pull a model
vllmlx pull qwen2-vl-7b-instruct-4bit
# Start the daemon (auto-starts on login after this)
vllmlx daemon start
# Chat interactively
vllmlx run qwen2-vl-7b-instruct-4bit
# Or use the API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2-vl-7b-instruct-4bit",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Requirements
- macOS 13+ (Apple Silicon)
- Python 3.11+
Installation
Using uv (Recommended)
# Install uv first if needed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install vllmlx
uv tool install vllmlx
Alternative: Using pip
pip install vllmlx
From Source
git clone https://github.com/lewi/vllmlx
cd vllmlx
uv sync
git config core.hooksPath .githooks
uv run vllmlx --help
For detailed installation instructions, see docs/installation.md.
Commands
| Command | Description |
|---|---|
vllmlx pull <model> |
Download a model |
vllmlx search [query] |
Search packaged mlx-community model catalog |
vllmlx ls |
List downloaded models |
vllmlx rm <model> |
Remove a model |
vllmlx run <model> |
Interactive chat (auto-starts daemon if needed) |
vllmlx benchmark <model> |
Measure cold/warm start, memory, TTFT, and token rate |
vllmlx serve |
Run server in foreground |
vllmlx daemon start |
Start background daemon |
vllmlx daemon stop |
Stop daemon |
vllmlx daemon restart |
Restart daemon |
vllmlx daemon status |
Check daemon status |
vllmlx daemon logs |
View daemon logs |
vllmlx config |
Show configuration |
vllmlx config set |
Set configuration value |
vllmlx config get |
Get configuration value |
For complete command reference, see docs/cli-reference.md.
Available Models
vllmlx works with any MLX-compatible model from HuggingFace.
Built-in aliases are generated from the packaged mlx-community catalog at:
src/vllmlx/models/data/mlx_community_models.json
Each catalog entry includes:
- alias
- HuggingFace repo id
- simple description
- model type (
text,vision,embedding,audio) - release date
- size in bytes (when available from Hub metadata)
- updated timestamp
vllmlx search and vllmlx ls use this packaged metadata locally, so discovery and cache inspection still work offline. Cached models also remain runnable offline; only new downloads require network access.
Regenerate the catalog with:
uv run python scripts/update_mlx_community_catalog.py
You can also use full HuggingFace paths:
vllmlx pull mlx-community/Some-Other-Model-4bit
Configuration
Config file: ~/.vllmlx/config.toml
[daemon]
port = 8000
host = "127.0.0.1"
idle_timeout = 600 # seconds
log_level = "info"
health_ttl_seconds = 1.0
[backend]
port = 8001 # internal worker port; must differ from daemon.port
[models]
default = "qwen2-vl-7b-instruct-4bit"
[aliases]
my-model = "mlx-community/Custom-Model-4bit"
Set values via CLI:
vllmlx config set daemon.idle_timeout 120
vllmlx config set models.default qwen2-vl-7b-instruct-4bit
vllmlx config set backend.port 8001
Optimization Profiles
vllmlx supports upstream vllm-mlx scheduler controls through backend.* config keys.
Balanced API (recommended):
vllmlx config set backend.continuous_batching true
vllmlx config set backend.stream_interval 1
vllmlx config set backend.max_num_seqs 256
vllmlx config set backend.max_num_batched_tokens 8192
vllmlx config set backend.chunked_prefill_tokens 0
Single-user latency:
vllmlx config set backend.continuous_batching false
vllmlx config set daemon.max_loaded_models 1
vllmlx config set daemon.idle_timeout 600
Multi-user throughput:
vllmlx config set backend.continuous_batching true
vllmlx config set backend.stream_interval 4
vllmlx config set backend.max_num_seqs 256
vllmlx config set backend.chunked_prefill_tokens 2048
vllmlx config set backend.prefill_step_size 2048
Tradeoffs:
backend.continuous_batching=trueimproves throughput under concurrency but may add overhead for single-user workloads.- Lower
backend.stream_intervalimproves stream smoothness; higher values can improve throughput. backend.chunked_prefill_tokens > 0improves fairness under long prompts by preventing prefill starvation.
See docs/dependency-upgrade-validation.md for the benchmark matrix and gating criteria used when validating MLX dependency upgrades.
API
vllmlx exposes an OpenAI-compatible API at http://localhost:8000:
Chat Completions
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2-vl-7b-instruct-4bit",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}
],
"stream": true
}'
List Models
curl http://localhost:8000/v1/models
Health Check
curl http://localhost:8000/health
Status
curl http://localhost:8000/v1/status
E2E Runner
Use the dedicated external runner for real-model parity checks:
uv run python scripts/run_e2e.py --mode smoke
Defaults:
- primary model:
mlx-community/Llama-3.2-1B-Instruct-4bit - secondary model:
mlx-community/TinyLlama-1.1B-Chat-v1.0-4bit - download-only model:
mlx-community/AMD-Llama-135m-4bit
Behavior:
smokerunsstartup_serve,api_core,run_cli, andbenchmark_smokefulladds downloads, LRU reuse, and knob propagation checks--allow-launchdenables the explicitstartup_launchdscenario- logs, PTY transcripts, and the JSON report are written under
.artifacts/e2e/
Prerequisites:
- main e2e scenarios expect the primary and secondary models to already exist in the Hugging Face cache
- only the dedicated download scenario is allowed to fetch models by default
- the runner isolates
vllmlxstate underVLLMLX_STATE_DIRand uses an isolated launchd label/path so it does not reuse the normal~/.vllmlxdaemon state
Benchmark JSON
vllmlx benchmark now supports machine-readable output:
vllmlx benchmark mlx-community/Llama-3.2-1B-Instruct-4bit --json -n 1 -t 16 --warmup 0
When --json is set, stdout contains only the benchmark summary JSON.
Troubleshooting
See docs/troubleshooting.md for common issues and solutions.
License
MIT - see LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllmlx-0.1.4.tar.gz.
File metadata
- Download URL: vllmlx-0.1.4.tar.gz
- Upload date:
- Size: 174.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0a61cf961009c4ae868ea6ac1916e32195b5a16a9ae548b77e8b0fa632cf8c0
|
|
| MD5 |
b9d951916748fcae335ecbd75b6909c3
|
|
| BLAKE2b-256 |
7958feae99c6657cf6dd0b4623bf7b4b7fc3efd133d0e02d70d8614b46a35351
|
Provenance
The following attestation bundles were made for vllmlx-0.1.4.tar.gz:
Publisher:
publish.yaml on l3wi/vllmlx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vllmlx-0.1.4.tar.gz -
Subject digest:
f0a61cf961009c4ae868ea6ac1916e32195b5a16a9ae548b77e8b0fa632cf8c0 - Sigstore transparency entry: 1175471729
- Sigstore integration time:
-
Permalink:
l3wi/vllmlx@cc6c462bce92634fc2d9fc59ffe121577cf28d25 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/l3wi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@cc6c462bce92634fc2d9fc59ffe121577cf28d25 -
Trigger Event:
push
-
Statement type:
File details
Details for the file vllmlx-0.1.4-py3-none-any.whl.
File metadata
- Download URL: vllmlx-0.1.4-py3-none-any.whl
- Upload date:
- Size: 198.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1dd1d4ec5e3980e5c6c799afe2d8045d323581723dd7c67b1f4abcd9c09774d
|
|
| MD5 |
7ba51be7f58633c353ddf6c3836d541b
|
|
| BLAKE2b-256 |
c84ff23f1307e83e11d1428e09491e7d78b9ea0f58295b90f741ea6cc4cd2c5e
|
Provenance
The following attestation bundles were made for vllmlx-0.1.4-py3-none-any.whl:
Publisher:
publish.yaml on l3wi/vllmlx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vllmlx-0.1.4-py3-none-any.whl -
Subject digest:
e1dd1d4ec5e3980e5c6c799afe2d8045d323581723dd7c67b1f4abcd9c09774d - Sigstore transparency entry: 1175471734
- Sigstore integration time:
-
Permalink:
l3wi/vllmlx@cc6c462bce92634fc2d9fc59ffe121577cf28d25 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/l3wi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@cc6c462bce92634fc2d9fc59ffe121577cf28d25 -
Trigger Event:
push
-
Statement type: