Skip to main content

vLLM platform plugin for Moore Threads MUSA GPUs

Project description

vLLM MUSA Platform Plugin

A vLLM platform plugin that enables running vLLM on Moore Threads MUSA GPUs.

Overview

This plugin provides MUSA (Moore Threads Unified Software Architecture) support for vLLM through:

  • torchada: CUDA→MUSA compatibility layer for PyTorch
  • pymtml: Moore Threads Management Library for device queries
  • Triton patches: Compatibility fixes for MUSA's Triton compiler

Requirements

  • Python 3.9+
  • vLLM
  • Moore Threads GPU with MUSA toolkit installed
  • torchada (CUDA→MUSA compatibility)
  • mthreads-ml-py (pymtml - MTML bindings)

Installation

From Source (Development)

# Clone the repository
git clone https://github.com/vllm-project/vllm-musa.git
cd vllm-musa

# Install in development mode
pip install -e .

# Or with development dependencies
pip install -e ".[dev]"

From PyPI (when published)

pip install vllm-musa

Verification

After installation, verify the plugin is registered:

python -c "from vllm_musa_platform import musa_platform_plugin; print('Plugin loaded successfully')"

Check if MTML (device management) is available:

python -c "from vllm_musa_platform import mtml; print(f'MTML available: {mtml.is_mtml_available()}')"

Usage

Once installed, the plugin is automatically detected by vLLM. Simply run vLLM as usual:

from vllm import LLM, SamplingParams

# vLLM will automatically use the MUSA platform
llm = LLM(model="your-model-path", trust_remote_code=True)

sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
outputs = llm.generate(["Hello, how are you?"], sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Environment Variables

  • MUSA_VISIBLE_DEVICES: Control which MUSA devices are visible (similar to CUDA_VISIBLE_DEVICES)
  • VLLM_WORKER_MULTIPROC_METHOD=spawn: Recommended for multi-process workers

Example

VLLM_WORKER_MULTIPROC_METHOD=spawn python -c "
from vllm import LLM, SamplingParams

llm = LLM(model='/path/to/model', trust_remote_code=True, enforce_eager=True)
outputs = llm.generate(['Hello!'], SamplingParams(max_tokens=20))
print(outputs[0].outputs[0].text)
"

Testing

Unit Tests

Run the test suite:

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_mtml.py -v
pytest tests/test_musa.py -v
pytest tests/test_patches.py -v

# Run with coverage
pytest tests/ -v --cov=vllm_musa_platform --cov-report=term-missing

Supported vLLM Versions

This plugin supports multiple vLLM versions:

vLLM Version PyTorch Version Engine Status
0.10.1.1 2.7.1 V0/V1 ✅ Supported
0.13.0 2.7.1 V1 only ✅ Supported

Testing with Different vLLM Versions

vLLM 0.10.1.1 (with torch 2.7.1)

# Install the plugin (vLLM 0.10.1.1 is installed automatically as a dependency)
pip install -e .

# Start the server
vllm serve /path/to/model/

# In another terminal, test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/path/to/model/", "prompt": "Hello!", "max_tokens": 50}'

vLLM 0.13.0 (with torch 2.7.1)

Important: Use --no-deps when upgrading vLLM to prevent torch from being replaced. The MUSA container includes a pre-configured torch 2.7.1 that must not be overwritten.

# Install the plugin (vLLM 0.10.1.1 is installed automatically as a dependency)
pip install -e .

# Upgrade to vLLM 0.13.0 without reinstalling dependencies
pip install vllm==0.13.0 --no-deps --upgrade

# Install additional dependencies required by vLLM 0.13.0
pip install 'depyf==0.20.0' 'llguidance>=1.3.0,<1.4.0' \
            'lm-format-enforcer==0.11.3' 'outlines_core==0.2.11' \
            'xgrammar==0.1.27' 'compressed-tensors==0.12.2' \
            'model-hosting-container-standards<1.0.0,>=0.1.9' \
            ijson anthropic mcp

# Start the server
vllm serve /path/to/model/

# In another terminal, test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/path/to/model/", "prompt": "Hello!", "max_tokens": 50}'

# Test chat completions
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/path/to/model/", "messages": [{"role": "user", "content": "What is 2+2?"}], "max_tokens": 50}'

Version-Specific Notes

vLLM 0.10.x

  • Supports both V0 and V1 engines
  • Uses VLLM_USE_V1=1 environment variable to enable V1 engine
  • The vllm.worker.worker module exists for V0 engine support

vLLM 0.13.x

  • V1 is the default (and only) engine
  • The vllm.worker module was removed (V0 engine deprecated)
  • Requires additional dependencies: depyf, llguidance, lm-format-enforcer, outlines_core, xgrammar, compressed-tensors

Docker Testing

For containerized testing with MUSA GPUs:

# Start a container with MUSA support
docker run -d --net host --privileged --pid=host --shm-size 500g \
  -v $PWD:/ws -w /ws \
  -v /data/vllm:/home/dist \
  --name musa-test \
  sh-harbor.mthreads.com/mcctest/musa-compile:rc4.3.3-torch2.7-20251120 \
  sleep infinity

# Enter the container
docker exec -it musa-test bash

# Inside the container, install and test
pip install -e /ws
vllm serve /home/dist/your-model/

Project Structure

vllm-musa/
├── pyproject.toml              # Project configuration
├── README.md                   # This file
├── vllm_musa_platform/         # Main package
│   ├── __init__.py             # Plugin entry point
│   ├── mtml.py                 # MTML wrapper (device management)
│   ├── musa.py                 # MUSA platform implementation
│   └── patches/                # Compatibility patches
│       ├── __init__.py         # Patch application logic
│       ├── README.md           # Patch documentation
│       ├── vllm__attention__ops__triton_unified_attention.patch.py
│       ├── vllm__v1__worker__gpu_worker.patch.py
│       └── vllm__worker__worker.patch.py
└── tests/                      # Test suite
    ├── conftest.py             # Pytest fixtures
    ├── test_mtml.py            # MTML wrapper tests
    ├── test_musa.py            # Platform tests
    └── test_patches.py         # Patch system tests

Patches

The plugin includes runtime patches for vLLM compatibility with MUSA's Triton compiler. See patches/README.md for details.

License

Apache-2.0

Contributing

Contributions are welcome! Please ensure all tests pass before submitting:

# Run tests
pytest tests/ -v

# Run linter (if ruff is installed)
ruff check .

# Run type checker (if mypy is installed)
mypy vllm_musa_platform/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_musa-0.1.0.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllm_musa-0.1.0-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file vllm_musa-0.1.0.tar.gz.

File metadata

  • Download URL: vllm_musa-0.1.0.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for vllm_musa-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6b7c39a9b040dd650eb0fd08b88ae97835fe046b1ea87bfcc6ce9476acf981b2
MD5 176a004047d3abc78ed63ce3b795e6d6
BLAKE2b-256 b1ce04f351632caabb74d7cac6829b1c565fbebb255d42474aeee6e659c9c442

See more details on using hashes here.

File details

Details for the file vllm_musa-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vllm_musa-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for vllm_musa-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c5ea84f3a0ca0cb925aa9b7bd94f32558950fb8c5a9b5ae66a85c615525b7443
MD5 116c4d8bc71c3af11d992799c7163a34
BLAKE2b-256 791f04f5584b81595c19846bf988d51584a9782db38dda7de7068c51a42b2e5a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page