Skip to main content

Efficient LLM inference on Slurm clusters using vLLM.

Project description

Vector Inference: Easy inference on Slurm clusters

This repository provides an easy-to-use solution to run inference servers on Slurm-managed computing clusters using vLLM. All scripts in this repository runs natively on the Vector Institute cluster environment. To adapt to other environments, update the config files in the vec_inf/models folder and the environment variables in the model launching scripts in vec_inf accordingly.

Installation

If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package:

pip install vec-inf

Otherwise, we recommend using the provided Dockerfile to set up your own environment with the package

Launch an inference server

We will use the Llama 3 model as example, to launch an inference server for Llama 3 8B, run:

vec-inf launch llama-3

You should see an output like the following:

There is a default variant for every model family, which is specified in vec_inf/models/{MODEL_FAMILY_NAME}/README.md, you can switch to other variants with the --model-variant option, and make sure to change the requested resource accordingly. More information about the available options can be found in the vec_inf/models folder. The inference server is compatible with the OpenAI Completion and ChatCompletion API.

You can check the inference server status by providing the Slurm job ID to the status command:

vec-inf status 13014393

You should see an output like the following:

There are 5 possible states:

  • PENDING: Job submitted to Slurm, but not executed yet.
  • LAUNCHING: Job is running but the server is not ready yet.
  • READY: Inference server running and ready to take requests.
  • FAILED: Inference server in an unhealthy state.
  • SHUTDOWN: Inference server is shutdown/cancelled.

Note that the base URL is only available when model is in READY state. Both launch and status command supports --json-mode, where the output information would be structured as a JSON string.

Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID:

vec-inf shutdown 13014393

> Shutting down model with Slurm Job ID: 13014393

Here is a more complicated example that launches a model variant using multiple nodes, say we want to launch Mixtral 8x22B, run

vec-inf launch mixtral --model-variant 8x22B-v0.1 --num-nodes 2 --num-gpus 4

And for launching a multimodal model, here is an example for launching LLaVa-NEXT Mistral 7B (default variant)

vec-inf launch llava-v1.6 --is-vlm 

Send inference requests

Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in examples folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run python examples/inference/llm/completions.py, and you should expect to see an output like the following:

{"id":"cmpl-bdf43763adf242588af07af88b070b62","object":"text_completion","created":2983960,"model":"/model-weights/Llama-2-7b-hf","choices":[{"index":0,"text":"\nCanada is close to the actual continent of North America. Aside from the Arctic islands","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":8,"total_tokens":28,"completion_tokens":20}}

NOTE: For multimodal models, currently only ChatCompletion is available, and only one image can be provided for each prompt.

SSH tunnel from your local device

If you want to run inference from your local device, you can open a SSH tunnel to your cluster environment like the following:

ssh -L 8081:172.17.8.29:8081 username@v.vectorinstitute.ai -N

The example provided above is for the vector cluster, change the variables accordingly for your environment

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vec_inf-0.3.0.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

vec_inf-0.3.0-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file vec_inf-0.3.0.tar.gz.

File metadata

  • Download URL: vec_inf-0.3.0.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.0

File hashes

Hashes for vec_inf-0.3.0.tar.gz
Algorithm Hash digest
SHA256 0cb2276e81f7f281057d1ed961533a3588c4241dd4d86c7786a0f7d7efbe1c74
MD5 4324490ab1c651d34109bea4d93c8f87
BLAKE2b-256 a796634031248ca51599d18b60b5ba40705dd5d94a87922754205c3acad50e59

See more details on using hashes here.

File details

Details for the file vec_inf-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: vec_inf-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 24.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.0

File hashes

Hashes for vec_inf-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 61686b7b59189a4b0e8ca7b305ef19fb2e01b72bd4538c86e3ecc8f6515765a5
MD5 fc7f91668fb9ff5931661b42193d7fe7
BLAKE2b-256 49f31a1e0f07b42341219955295604af2258196622514bee44f47e6f56262f40

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page