Skip to main content

Ollama-style daemon and CLI over vllm-mlx on Apple Silicon

Project description

vllmlx

Ollama-style daemon and CLI for vllm-mlx.

Features

  • 🚀 Always-on daemon - API available immediately after install, survives reboots
  • 🎯 Simple CLI - vllmlx pull, vllmlx run, vllmlx ls - familiar Ollama-style commands
  • 🔄 Hot-swap models - Switch models on-the-fly without restarting
  • 💾 Smart memory - Auto-unloads models after idle timeout
  • 🤖 OpenAI-compatible API - Works with existing tools at localhost:8000

Quick Start

# Install with uv (recommended)
uv tool install vllmlx

# Pull a model
vllmlx pull qwen2-vl-7b-instruct-4bit

# Start the daemon (auto-starts on login after this)
vllmlx daemon start

# Chat interactively
vllmlx run qwen2-vl-7b-instruct-4bit

# Or use the API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2-vl-7b-instruct-4bit",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Requirements

  • macOS 13+ (Apple Silicon)
  • Python 3.11+

Installation

Using uv (Recommended)

# Install uv first if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install vllmlx
uv tool install vllmlx

Alternative: Using pip

pip install vllmlx

From Source

git clone https://github.com/lewi/vllmlx
cd vllmlx
uv sync
git config core.hooksPath .githooks
uv run vllmlx --help

For detailed installation instructions, see docs/installation.md.

Commands

Command Description
vllmlx pull <model> Download a model
vllmlx search [query] Search packaged mlx-community model catalog
vllmlx ls List downloaded models
vllmlx rm <model> Remove a model
vllmlx run <model> Interactive chat (auto-starts daemon if needed)
vllmlx benchmark <model> Measure cold/warm start, memory, TTFT, and token rate
vllmlx serve Run server in foreground
vllmlx daemon start Start background daemon
vllmlx daemon stop Stop daemon
vllmlx daemon restart Restart daemon
vllmlx daemon status Check daemon status
vllmlx daemon logs View daemon logs
vllmlx config Show configuration
vllmlx config set Set configuration value
vllmlx config get Get configuration value

For complete command reference, see docs/cli-reference.md.

Available Models

vllmlx works with any MLX-compatible model from HuggingFace.

Built-in aliases are generated from the packaged mlx-community catalog at:

  • src/vllmlx/models/data/mlx_community_models.json

Each catalog entry includes:

  • alias
  • HuggingFace repo id
  • simple description
  • model type (text, vision, embedding, audio)
  • release date
  • size in bytes (when available from Hub metadata)
  • updated timestamp

vllmlx search and vllmlx ls use this packaged metadata locally, so discovery and cache inspection still work offline. Cached models also remain runnable offline; only new downloads require network access.

Regenerate the catalog with:

uv run python scripts/update_mlx_community_catalog.py

You can also use full HuggingFace paths:

vllmlx pull mlx-community/Some-Other-Model-4bit

Configuration

Config file: ~/.vllmlx/config.toml

[daemon]
port = 8000
host = "127.0.0.1"
idle_timeout = 600  # seconds
log_level = "info"
health_ttl_seconds = 1.0

[backend]
port = 8001  # internal worker port; must differ from daemon.port

[models]
default = "qwen2-vl-7b-instruct-4bit"

[aliases]
my-model = "mlx-community/Custom-Model-4bit"

Set values via CLI:

vllmlx config set daemon.idle_timeout 120
vllmlx config set models.default qwen2-vl-7b-instruct-4bit
vllmlx config set backend.port 8001

Optimization Profiles

vllmlx supports upstream vllm-mlx scheduler controls through backend.* config keys.

Balanced API (recommended):

vllmlx config set backend.continuous_batching true
vllmlx config set backend.stream_interval 1
vllmlx config set backend.max_num_seqs 256
vllmlx config set backend.max_num_batched_tokens 8192
vllmlx config set backend.chunked_prefill_tokens 0

Single-user latency:

vllmlx config set backend.continuous_batching false
vllmlx config set daemon.max_loaded_models 1
vllmlx config set daemon.idle_timeout 600

Multi-user throughput:

vllmlx config set backend.continuous_batching true
vllmlx config set backend.stream_interval 4
vllmlx config set backend.max_num_seqs 256
vllmlx config set backend.chunked_prefill_tokens 2048
vllmlx config set backend.prefill_step_size 2048

Tradeoffs:

  • backend.continuous_batching=true improves throughput under concurrency but may add overhead for single-user workloads.
  • Lower backend.stream_interval improves stream smoothness; higher values can improve throughput.
  • backend.chunked_prefill_tokens > 0 improves fairness under long prompts by preventing prefill starvation.

See docs/dependency-upgrade-validation.md for the benchmark matrix and gating criteria used when validating MLX dependency upgrades.

API

vllmlx exposes an OpenAI-compatible API at http://localhost:8000:

Chat Completions

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2-vl-7b-instruct-4bit",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
        ]
      }
    ],
    "stream": true
  }'

List Models

curl http://localhost:8000/v1/models

Health Check

curl http://localhost:8000/health

Status

curl http://localhost:8000/v1/status

E2E Runner

Use the dedicated external runner for real-model parity checks:

uv run python scripts/run_e2e.py --mode smoke

Defaults:

  • primary model: mlx-community/Llama-3.2-1B-Instruct-4bit
  • secondary model: mlx-community/TinyLlama-1.1B-Chat-v1.0-4bit
  • download-only model: mlx-community/AMD-Llama-135m-4bit

Behavior:

  • smoke runs startup_serve, api_core, run_cli, and benchmark_smoke
  • full adds downloads, LRU reuse, and knob propagation checks
  • --allow-launchd enables the explicit startup_launchd scenario
  • logs, PTY transcripts, and the JSON report are written under .artifacts/e2e/

Prerequisites:

  • main e2e scenarios expect the primary and secondary models to already exist in the Hugging Face cache
  • only the dedicated download scenario is allowed to fetch models by default
  • the runner isolates vllmlx state under VLLMLX_STATE_DIR and uses an isolated launchd label/path so it does not reuse the normal ~/.vllmlx daemon state

Benchmark JSON

vllmlx benchmark now supports machine-readable output:

vllmlx benchmark mlx-community/Llama-3.2-1B-Instruct-4bit --json -n 1 -t 16 --warmup 0

When --json is set, stdout contains only the benchmark summary JSON.

Troubleshooting

See docs/troubleshooting.md for common issues and solutions.

License

MIT - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllmlx-0.1.4.tar.gz (174.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllmlx-0.1.4-py3-none-any.whl (198.4 kB view details)

Uploaded Python 3

File details

Details for the file vllmlx-0.1.4.tar.gz.

File metadata

  • Download URL: vllmlx-0.1.4.tar.gz
  • Upload date:
  • Size: 174.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vllmlx-0.1.4.tar.gz
Algorithm Hash digest
SHA256 f0a61cf961009c4ae868ea6ac1916e32195b5a16a9ae548b77e8b0fa632cf8c0
MD5 b9d951916748fcae335ecbd75b6909c3
BLAKE2b-256 7958feae99c6657cf6dd0b4623bf7b4b7fc3efd133d0e02d70d8614b46a35351

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllmlx-0.1.4.tar.gz:

Publisher: publish.yaml on l3wi/vllmlx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vllmlx-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: vllmlx-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 198.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vllmlx-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e1dd1d4ec5e3980e5c6c799afe2d8045d323581723dd7c67b1f4abcd9c09774d
MD5 7ba51be7f58633c353ddf6c3836d541b
BLAKE2b-256 c84ff23f1307e83e11d1428e09491e7d78b9ea0f58295b90f741ea6cc4cd2c5e

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllmlx-0.1.4-py3-none-any.whl:

Publisher: publish.yaml on l3wi/vllmlx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page