Skip to main content

Ollama-style daemon and CLI over vllm-mlx on Apple Silicon

Project description

vllmlx

Ollama-style daemon and CLI for vllm-mlx.

Features

  • 🚀 Always-on daemon - API available immediately after install, survives reboots
  • 🎯 Simple CLI - vllmlx pull, vllmlx run, vllmlx ls - familiar Ollama-style commands
  • 🔄 Hot-swap models - Switch models on-the-fly without restarting
  • 💾 Smart memory - Auto-unloads models after idle timeout
  • 🤖 OpenAI-compatible API - Works with existing tools at localhost:11434

Quick Start

# Install
pip install vllmlx
# Or with uv (recommended)
uv tool install vllmlx

# Pull a model
vllmlx pull qwen2-vl-7b

# Start the daemon (auto-starts on login after this)
vllmlx daemon start

# Chat interactively
vllmlx run qwen2-vl-7b

# Or use the API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2-vl-7b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Requirements

  • macOS 13+ (Apple Silicon)
  • Python 3.11+

Installation

Using uv (Recommended)

uv tool install vllmlx

Using pip

pip install vllmlx

From Source

git clone https://github.com/lewi/vllmlx
cd vllmlx
pip install -e .

For detailed installation instructions, see docs/installation.md.

Commands

Command Description
vllmlx pull <model> Download a model
vllmlx search [query] Search packaged mlx-community model catalog
vllmlx ls List downloaded models
vllmlx rm <model> Remove a model
vllmlx run <model> Interactive chat (auto-starts daemon if needed)
vllmlx benchmark <model> Measure cold/warm start, memory, TTFT, and token rate
vllmlx serve Run server in foreground
vllmlx daemon start Start background daemon
vllmlx daemon stop Stop daemon
vllmlx daemon restart Restart daemon
vllmlx daemon status Check daemon status
vllmlx daemon logs View daemon logs
vllmlx config Show configuration
vllmlx config set Set configuration value
vllmlx config get Get configuration value

For complete command reference, see docs/cli-reference.md.

Available Models

vllmlx works with any MLX-compatible model from HuggingFace.

Built-in aliases are generated from the packaged mlx-community catalog at:

  • src/vllmlx/models/data/mlx_community_models.json

Each catalog entry includes:

  • alias
  • HuggingFace repo id
  • simple description
  • model type (text, vision, embedding, audio)
  • release date
  • size in bytes (when available from Hub metadata)
  • updated timestamp

vllmlx search and vllmlx ls use this packaged metadata locally, so discovery and cache inspection still work offline. Cached models also remain runnable offline; only new downloads require network access.

Regenerate the catalog with:

uv run python scripts/update_mlx_community_catalog.py

Legacy short aliases like qwen2-vl-7b, qwen3:8b, and qwen3-embedding:4b are retained for compatibility.

You can also use full HuggingFace paths:

vllmlx pull mlx-community/Some-Other-Model-4bit

Configuration

Config file: ~/.vllmlx/config.toml

[daemon]
port = 11434
host = "127.0.0.1"
idle_timeout = 600  # seconds
log_level = "info"
health_ttl_seconds = 1.0

[models]
default = "qwen2-vl-7b"

[aliases]
my-model = "mlx-community/Custom-Model-4bit"

Set values via CLI:

vllmlx config set daemon.idle_timeout 120
vllmlx config set models.default qwen2-vl-7b

Optimization Profiles

vllmlx supports upstream vllm-mlx scheduler controls through backend.* config keys.

Balanced API (recommended):

vllmlx config set backend.continuous_batching true
vllmlx config set backend.stream_interval 1
vllmlx config set backend.max_num_seqs 256
vllmlx config set backend.max_num_batched_tokens 8192
vllmlx config set backend.chunked_prefill_tokens 0

Single-user latency:

vllmlx config set backend.continuous_batching false
vllmlx config set daemon.max_loaded_models 1
vllmlx config set daemon.idle_timeout 600

Multi-user throughput:

vllmlx config set backend.continuous_batching true
vllmlx config set backend.stream_interval 4
vllmlx config set backend.max_num_seqs 256
vllmlx config set backend.chunked_prefill_tokens 2048
vllmlx config set backend.prefill_step_size 2048

Tradeoffs:

  • backend.continuous_batching=true improves throughput under concurrency but may add overhead for single-user workloads.
  • Lower backend.stream_interval improves stream smoothness; higher values can improve throughput.
  • backend.chunked_prefill_tokens > 0 improves fairness under long prompts by preventing prefill starvation.

See docs/dependency-upgrade-validation.md for the benchmark matrix and gating criteria used when validating MLX dependency upgrades.

API

vllmlx exposes an OpenAI-compatible API at http://localhost:11434:

Chat Completions

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2-vl-7b",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
        ]
      }
    ],
    "stream": true
  }'

List Models

curl http://localhost:11434/v1/models

Health Check

curl http://localhost:11434/health

Status

curl http://localhost:11434/v1/status

E2E Runner

Use the dedicated external runner for real-model parity checks:

uv run python scripts/run_e2e.py --mode smoke

Defaults:

  • primary model: mlx-community/Llama-3.2-1B-Instruct-4bit
  • secondary model: mlx-community/TinyLlama-1.1B-Chat-v1.0-4bit
  • download-only model: mlx-community/AMD-Llama-135m-4bit

Behavior:

  • smoke runs startup_serve, api_core, run_cli, and benchmark_smoke
  • full adds downloads, LRU reuse, and knob propagation checks
  • --allow-launchd enables the explicit startup_launchd scenario
  • logs, PTY transcripts, and the JSON report are written under .artifacts/e2e/

Prerequisites:

  • main e2e scenarios expect the primary and secondary models to already exist in the Hugging Face cache
  • only the dedicated download scenario is allowed to fetch models by default
  • the runner isolates vllmlx state under VLLMLX_STATE_DIR and uses an isolated launchd label/path so it does not reuse the normal ~/.vllmlx daemon state

Benchmark JSON

vllmlx benchmark now supports machine-readable output:

vllmlx benchmark mlx-community/Llama-3.2-1B-Instruct-4bit --json -n 1 -t 16 --warmup 0

When --json is set, stdout contains only the benchmark summary JSON.

Troubleshooting

See docs/troubleshooting.md for common issues and solutions.

License

MIT - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllmlx-0.1.1.tar.gz (173.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllmlx-0.1.1-py3-none-any.whl (197.9 kB view details)

Uploaded Python 3

File details

Details for the file vllmlx-0.1.1.tar.gz.

File metadata

  • Download URL: vllmlx-0.1.1.tar.gz
  • Upload date:
  • Size: 173.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vllmlx-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9f4cef2bd6bd8a2f7dc42178effa31f8b1101b31246c786008358e0c2df08d35
MD5 703faadd28161a47ac21cfe12e66e775
BLAKE2b-256 f951fd87b3c151cb9ba0ec0868c4fa01e971203b83dc19d7d93d6373c0c5e9ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllmlx-0.1.1.tar.gz:

Publisher: publish.yaml on l3wi/vllmlx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vllmlx-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: vllmlx-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 197.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vllmlx-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1ae045c488cca1c18178fbb8ae7b3132ddc00edbd6d69dcfa2622b13380d6ae4
MD5 a4669383e35342bf9c9b17202ac489b6
BLAKE2b-256 db682236d3cc2f4efffe8b2883ea4de4492110b8c4951fda76153076f63e9398

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllmlx-0.1.1-py3-none-any.whl:

Publisher: publish.yaml on l3wi/vllmlx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page