Ollama-style daemon and CLI over vllm-mlx on Apple Silicon

These details have not been verified by PyPI

Project links

Project description

vllmlx

Ollama-style daemon and CLI for vllm-mlx.

Features

🚀 Always-on daemon - API available immediately after install, survives reboots
🎯 Simple CLI - vllmlx pull, vllmlx run, vllmlx ls - familiar Ollama-style commands
🔄 Hot-swap models - Switch models on-the-fly without restarting
💾 Smart memory - Auto-unloads models after idle timeout
🤖 OpenAI-compatible API - Works with existing tools at localhost:8000

Quick Start

# Install with uv (recommended)
uv tool install vllmlx

# Pull a model
vllmlx pull qwen2-vl-7b-instruct-4bit

# Start the daemon (auto-starts on login after this)
vllmlx daemon start

# Chat interactively
vllmlx run qwen2-vl-7b-instruct-4bit

# Or use the API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2-vl-7b-instruct-4bit",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Requirements

macOS 13+ (Apple Silicon)
Python 3.11+

Installation

Using uv (Recommended)

# Install uv first if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install vllmlx
uv tool install vllmlx

Alternative: Using pip

pip install vllmlx

From Source

git clone https://github.com/lewi/vllmlx
cd vllmlx
uv sync
uv run vllmlx --help

For detailed installation instructions, see docs/installation.md.

Commands

Command	Description
`vllmlx pull <model>`	Download a model
`vllmlx search [query]`	Search packaged mlx-community model catalog
`vllmlx ls`	List downloaded models
`vllmlx rm <model>`	Remove a model
`vllmlx run <model>`	Interactive chat (auto-starts daemon if needed)
`vllmlx benchmark <model>`	Measure cold/warm start, memory, TTFT, and token rate
`vllmlx serve`	Run server in foreground
`vllmlx daemon start`	Start background daemon
`vllmlx daemon stop`	Stop daemon
`vllmlx daemon restart`	Restart daemon
`vllmlx daemon status`	Check daemon status
`vllmlx daemon logs`	View daemon logs
`vllmlx config`	Show configuration
`vllmlx config set`	Set configuration value
`vllmlx config get`	Get configuration value

For complete command reference, see docs/cli-reference.md.

Available Models

vllmlx works with any MLX-compatible model from HuggingFace.

Built-in aliases are generated from the packaged mlx-community catalog at:

src/vllmlx/models/data/mlx_community_models.json

Each catalog entry includes:

alias
HuggingFace repo id
simple description
model type (text, vision, embedding, audio)
release date
size in bytes (when available from Hub metadata)
updated timestamp

vllmlx search and vllmlx ls use this packaged metadata locally, so discovery and cache inspection still work offline. Cached models also remain runnable offline; only new downloads require network access.

Regenerate the catalog with:

uv run python scripts/update_mlx_community_catalog.py

You can also use full HuggingFace paths:

vllmlx pull mlx-community/Some-Other-Model-4bit

Configuration

Config file: ~/.vllmlx/config.toml

[daemon]
port = 8000
host = "127.0.0.1"
idle_timeout = 600  # seconds
log_level = "info"
health_ttl_seconds = 1.0

[models]
default = "qwen2-vl-7b-instruct-4bit"

[aliases]
my-model = "mlx-community/Custom-Model-4bit"

Set values via CLI:

vllmlx config set daemon.idle_timeout 120
vllmlx config set models.default qwen2-vl-7b-instruct-4bit

Optimization Profiles

vllmlx supports upstream vllm-mlx scheduler controls through backend.* config keys.

Balanced API (recommended):

vllmlx config set backend.continuous_batching true
vllmlx config set backend.stream_interval 1
vllmlx config set backend.max_num_seqs 256
vllmlx config set backend.max_num_batched_tokens 8192
vllmlx config set backend.chunked_prefill_tokens 0

Single-user latency:

vllmlx config set backend.continuous_batching false
vllmlx config set daemon.max_loaded_models 1
vllmlx config set daemon.idle_timeout 600

Multi-user throughput:

vllmlx config set backend.continuous_batching true
vllmlx config set backend.stream_interval 4
vllmlx config set backend.max_num_seqs 256
vllmlx config set backend.chunked_prefill_tokens 2048
vllmlx config set backend.prefill_step_size 2048

Tradeoffs:

backend.continuous_batching=true improves throughput under concurrency but may add overhead for single-user workloads.
Lower backend.stream_interval improves stream smoothness; higher values can improve throughput.
backend.chunked_prefill_tokens > 0 improves fairness under long prompts by preventing prefill starvation.

See docs/dependency-upgrade-validation.md for the benchmark matrix and gating criteria used when validating MLX dependency upgrades.

API

vllmlx exposes an OpenAI-compatible API at http://localhost:8000:

Chat Completions

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2-vl-7b-instruct-4bit",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
        ]
      }
    ],
    "stream": true
  }'

List Models

curl http://localhost:8000/v1/models

Health Check

curl http://localhost:8000/health

Status

curl http://localhost:8000/v1/status

E2E Runner

Use the dedicated external runner for real-model parity checks:

uv run python scripts/run_e2e.py --mode smoke

Defaults:

primary model: mlx-community/Llama-3.2-1B-Instruct-4bit
secondary model: mlx-community/TinyLlama-1.1B-Chat-v1.0-4bit
download-only model: mlx-community/AMD-Llama-135m-4bit

Behavior:

smoke runs startup_serve, api_core, run_cli, and benchmark_smoke
full adds downloads, LRU reuse, and knob propagation checks
--allow-launchd enables the explicit startup_launchd scenario
logs, PTY transcripts, and the JSON report are written under .artifacts/e2e/

Prerequisites:

main e2e scenarios expect the primary and secondary models to already exist in the Hugging Face cache
only the dedicated download scenario is allowed to fetch models by default
the runner isolates vllmlx state under VLLMLX_STATE_DIR and uses an isolated launchd label/path so it does not reuse the normal ~/.vllmlx daemon state

Benchmark JSON

vllmlx benchmark now supports machine-readable output:

vllmlx benchmark mlx-community/Llama-3.2-1B-Instruct-4bit --json -n 1 -t 16 --warmup 0

When --json is set, stdout contains only the benchmark summary JSON.

Troubleshooting

See docs/troubleshooting.md for common issues and solutions.

License

MIT - see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.4

Mar 24, 2026

This version

0.1.3

Mar 24, 2026

0.1.2

Mar 24, 2026

0.1.1

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllmlx-0.1.3.tar.gz (174.2 kB view details)

Uploaded Mar 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllmlx-0.1.3-py3-none-any.whl (198.1 kB view details)

Uploaded Mar 24, 2026 Python 3

File details

Details for the file vllmlx-0.1.3.tar.gz.

File metadata

Download URL: vllmlx-0.1.3.tar.gz
Upload date: Mar 24, 2026
Size: 174.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vllmlx-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`f0d6d10042674cf9397c28f981d52b7c228d0cecffb1f42dbe1c8e4e34c87357`
MD5	`09cf4f054e8bc8533b5918aa5d19bc87`
BLAKE2b-256	`6649a0f0c5f3a815465262ce953076f6d5531e80e401264efa41510e47bfe683`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllmlx-0.1.3.tar.gz:

Publisher: publish.yaml on l3wi/vllmlx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vllmlx-0.1.3.tar.gz
- Subject digest: f0d6d10042674cf9397c28f981d52b7c228d0cecffb1f42dbe1c8e4e34c87357
- Sigstore transparency entry: 1175264774
- Sigstore integration time: Mar 24, 2026
Source repository:
- Permalink: l3wi/vllmlx@121cff7355a3150bc328cdb3a1049a5e585878a0
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/l3wi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@121cff7355a3150bc328cdb3a1049a5e585878a0
- Trigger Event: push

File details

Details for the file vllmlx-0.1.3-py3-none-any.whl.

File metadata

Download URL: vllmlx-0.1.3-py3-none-any.whl
Upload date: Mar 24, 2026
Size: 198.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vllmlx-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`314f8c60b8fb39fb903e8cab19f317536064e7217ae0cba8735b3d584d1781d4`
MD5	`90852f84f4eee38e30386e5d25d30216`
BLAKE2b-256	`14711b6f0290ffa45dace9bda92d1fcd57afeeea2cf5b9779189703fb0eef196`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllmlx-0.1.3-py3-none-any.whl:

Publisher: publish.yaml on l3wi/vllmlx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vllmlx-0.1.3-py3-none-any.whl
- Subject digest: 314f8c60b8fb39fb903e8cab19f317536064e7217ae0cba8735b3d584d1781d4
- Sigstore transparency entry: 1175265231
- Sigstore integration time: Mar 24, 2026
Source repository:
- Permalink: l3wi/vllmlx@121cff7355a3150bc328cdb3a1049a5e585878a0
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/l3wi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@121cff7355a3150bc328cdb3a1049a5e585878a0
- Trigger Event: push

vllmlx 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

vllmlx

Features

Quick Start

Requirements

Installation

Using uv (Recommended)

Alternative: Using pip

From Source

Commands

Available Models

Configuration

Optimization Profiles

API

Chat Completions

List Models

Health Check

Status

E2E Runner

Benchmark JSON

Troubleshooting

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance