Skip to main content

Ollama-style daemon and CLI over vllm-mlx on Apple Silicon

Project description

vllmlx

Ollama-style daemon and CLI for vllm-mlx.

Features

  • 🚀 Always-on daemon - API available immediately after install, survives reboots
  • 🎯 Simple CLI - vllmlx pull, vllmlx run, vllmlx ls - familiar Ollama-style commands
  • 🔄 Hot-swap models - Switch models on-the-fly without restarting
  • 💾 Smart memory - Auto-unloads models after idle timeout
  • 🤖 OpenAI-compatible API - Works with existing tools at localhost:8000

Quick Start

# Install with uv (recommended)
uv tool install vllmlx

# Pull a model
vllmlx pull qwen2-vl-7b-instruct-4bit

# Start the daemon (auto-starts on login after this)
vllmlx daemon start

# Chat interactively
vllmlx run qwen2-vl-7b-instruct-4bit

# Or use the API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2-vl-7b-instruct-4bit",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Requirements

  • macOS 13+ (Apple Silicon)
  • Python 3.11+

Installation

Using uv (Recommended)

# Install uv first if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install vllmlx
uv tool install vllmlx

Alternative: Using pip

pip install vllmlx

From Source

git clone https://github.com/lewi/vllmlx
cd vllmlx
uv sync
uv run vllmlx --help

For detailed installation instructions, see docs/installation.md.

Commands

Command Description
vllmlx pull <model> Download a model
vllmlx search [query] Search packaged mlx-community model catalog
vllmlx ls List downloaded models
vllmlx rm <model> Remove a model
vllmlx run <model> Interactive chat (auto-starts daemon if needed)
vllmlx benchmark <model> Measure cold/warm start, memory, TTFT, and token rate
vllmlx serve Run server in foreground
vllmlx daemon start Start background daemon
vllmlx daemon stop Stop daemon
vllmlx daemon restart Restart daemon
vllmlx daemon status Check daemon status
vllmlx daemon logs View daemon logs
vllmlx config Show configuration
vllmlx config set Set configuration value
vllmlx config get Get configuration value

For complete command reference, see docs/cli-reference.md.

Available Models

vllmlx works with any MLX-compatible model from HuggingFace.

Built-in aliases are generated from the packaged mlx-community catalog at:

  • src/vllmlx/models/data/mlx_community_models.json

Each catalog entry includes:

  • alias
  • HuggingFace repo id
  • simple description
  • model type (text, vision, embedding, audio)
  • release date
  • size in bytes (when available from Hub metadata)
  • updated timestamp

vllmlx search and vllmlx ls use this packaged metadata locally, so discovery and cache inspection still work offline. Cached models also remain runnable offline; only new downloads require network access.

Regenerate the catalog with:

uv run python scripts/update_mlx_community_catalog.py

You can also use full HuggingFace paths:

vllmlx pull mlx-community/Some-Other-Model-4bit

Configuration

Config file: ~/.vllmlx/config.toml

[daemon]
port = 8000
host = "127.0.0.1"
idle_timeout = 600  # seconds
log_level = "info"
health_ttl_seconds = 1.0

[models]
default = "qwen2-vl-7b-instruct-4bit"

[aliases]
my-model = "mlx-community/Custom-Model-4bit"

Set values via CLI:

vllmlx config set daemon.idle_timeout 120
vllmlx config set models.default qwen2-vl-7b-instruct-4bit

Optimization Profiles

vllmlx supports upstream vllm-mlx scheduler controls through backend.* config keys.

Balanced API (recommended):

vllmlx config set backend.continuous_batching true
vllmlx config set backend.stream_interval 1
vllmlx config set backend.max_num_seqs 256
vllmlx config set backend.max_num_batched_tokens 8192
vllmlx config set backend.chunked_prefill_tokens 0

Single-user latency:

vllmlx config set backend.continuous_batching false
vllmlx config set daemon.max_loaded_models 1
vllmlx config set daemon.idle_timeout 600

Multi-user throughput:

vllmlx config set backend.continuous_batching true
vllmlx config set backend.stream_interval 4
vllmlx config set backend.max_num_seqs 256
vllmlx config set backend.chunked_prefill_tokens 2048
vllmlx config set backend.prefill_step_size 2048

Tradeoffs:

  • backend.continuous_batching=true improves throughput under concurrency but may add overhead for single-user workloads.
  • Lower backend.stream_interval improves stream smoothness; higher values can improve throughput.
  • backend.chunked_prefill_tokens > 0 improves fairness under long prompts by preventing prefill starvation.

See docs/dependency-upgrade-validation.md for the benchmark matrix and gating criteria used when validating MLX dependency upgrades.

API

vllmlx exposes an OpenAI-compatible API at http://localhost:8000:

Chat Completions

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2-vl-7b-instruct-4bit",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
        ]
      }
    ],
    "stream": true
  }'

List Models

curl http://localhost:8000/v1/models

Health Check

curl http://localhost:8000/health

Status

curl http://localhost:8000/v1/status

E2E Runner

Use the dedicated external runner for real-model parity checks:

uv run python scripts/run_e2e.py --mode smoke

Defaults:

  • primary model: mlx-community/Llama-3.2-1B-Instruct-4bit
  • secondary model: mlx-community/TinyLlama-1.1B-Chat-v1.0-4bit
  • download-only model: mlx-community/AMD-Llama-135m-4bit

Behavior:

  • smoke runs startup_serve, api_core, run_cli, and benchmark_smoke
  • full adds downloads, LRU reuse, and knob propagation checks
  • --allow-launchd enables the explicit startup_launchd scenario
  • logs, PTY transcripts, and the JSON report are written under .artifacts/e2e/

Prerequisites:

  • main e2e scenarios expect the primary and secondary models to already exist in the Hugging Face cache
  • only the dedicated download scenario is allowed to fetch models by default
  • the runner isolates vllmlx state under VLLMLX_STATE_DIR and uses an isolated launchd label/path so it does not reuse the normal ~/.vllmlx daemon state

Benchmark JSON

vllmlx benchmark now supports machine-readable output:

vllmlx benchmark mlx-community/Llama-3.2-1B-Instruct-4bit --json -n 1 -t 16 --warmup 0

When --json is set, stdout contains only the benchmark summary JSON.

Troubleshooting

See docs/troubleshooting.md for common issues and solutions.

License

MIT - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllmlx-0.1.3.tar.gz (174.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllmlx-0.1.3-py3-none-any.whl (198.1 kB view details)

Uploaded Python 3

File details

Details for the file vllmlx-0.1.3.tar.gz.

File metadata

  • Download URL: vllmlx-0.1.3.tar.gz
  • Upload date:
  • Size: 174.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vllmlx-0.1.3.tar.gz
Algorithm Hash digest
SHA256 f0d6d10042674cf9397c28f981d52b7c228d0cecffb1f42dbe1c8e4e34c87357
MD5 09cf4f054e8bc8533b5918aa5d19bc87
BLAKE2b-256 6649a0f0c5f3a815465262ce953076f6d5531e80e401264efa41510e47bfe683

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllmlx-0.1.3.tar.gz:

Publisher: publish.yaml on l3wi/vllmlx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vllmlx-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: vllmlx-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 198.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vllmlx-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 314f8c60b8fb39fb903e8cab19f317536064e7217ae0cba8735b3d584d1781d4
MD5 90852f84f4eee38e30386e5d25d30216
BLAKE2b-256 14711b6f0290ffa45dace9bda92d1fcd57afeeea2cf5b9779189703fb0eef196

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllmlx-0.1.3-py3-none-any.whl:

Publisher: publish.yaml on l3wi/vllmlx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page