Efficient LLM inference on Slurm clusters using vLLM.
Project description
Vector Inference: Easy inference on Slurm clusters
This repository provides an easy-to-use solution to run inference servers on Slurm-managed computing clusters using vLLM. All scripts in this repository runs natively on the Vector Institute cluster environment. To adapt to other environments, update the config files in the vec_inf/models
folder and the environment variables in the model launching scripts in vec_inf
accordingly.
Installation
If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package:
pip install vec-inf
Otherwise, we recommend using the provided Dockerfile
to set up your own environment with the package
Launch an inference server
We will use the Llama 3 model as example, to launch an inference server for Llama 3 8B, run:
vec-inf launch llama-3
You should see an output like the following:
There is a default variant for every model family, which is specified in vec_inf/models/{MODEL_FAMILY_NAME}/README.md
, you can switch to other variants with the --model-variant
option, and make sure to change the requested resource accordingly. More information about the available options can be found in the vec_inf/models
folder. The inference server is compatible with the OpenAI Completion
and ChatCompletion
API.
You can check the inference server status by providing the Slurm job ID to the status
command:
vec-inf status 13014393
You should see an output like the following:
There are 5 possible states:
- PENDING: Job submitted to Slurm, but not executed yet.
- LAUNCHING: Job is running but the server is not ready yet.
- READY: Inference server running and ready to take requests.
- FAILED: Inference server in an unhealthy state.
- SHUTDOWN: Inference server is shutdown/cancelled.
Note that the base URL is only available when model is in READY
state.
Both launch
and status
command supports --json-mode
, where the output information would be structured as a JSON string.
Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID:
vec-inf shutdown 13014393
> Shutting down model with Slurm Job ID: 13014393
Here is a more complicated example that launches a model variant using multiple nodes, say we want to launch Mixtral 8x22B, run
vec-inf launch mixtral --model-variant 8x22B-v0.1 --num-nodes 2 --num-gpus 4
And for launching a multimodal model, here is an example for launching LLaVa-NEXT Mistral 7B (default variant)
vec-inf launch llava-v1.6 --is-vlm
Send inference requests
Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in examples
folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run python examples/inference/llm/completions.py
, and you should expect to see an output like the following:
{"id":"cmpl-bdf43763adf242588af07af88b070b62","object":"text_completion","created":2983960,"model":"/model-weights/Llama-2-7b-hf","choices":[{"index":0,"text":"\nCanada is close to the actual continent of North America. Aside from the Arctic islands","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":8,"total_tokens":28,"completion_tokens":20}}
NOTE: For multimodal models, currently only ChatCompletion
is available, and only one image can be provided for each prompt.
SSH tunnel from your local device
If you want to run inference from your local device, you can open a SSH tunnel to your cluster environment like the following:
ssh -L 8081:172.17.8.29:8081 username@v.vectorinstitute.ai -N
The example provided above is for the vector cluster, change the variables accordingly for your environment
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file vec_inf-0.3.0.tar.gz
.
File metadata
- Download URL: vec_inf-0.3.0.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0cb2276e81f7f281057d1ed961533a3588c4241dd4d86c7786a0f7d7efbe1c74 |
|
MD5 | 4324490ab1c651d34109bea4d93c8f87 |
|
BLAKE2b-256 | a796634031248ca51599d18b60b5ba40705dd5d94a87922754205c3acad50e59 |
File details
Details for the file vec_inf-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: vec_inf-0.3.0-py3-none-any.whl
- Upload date:
- Size: 24.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 61686b7b59189a4b0e8ca7b305ef19fb2e01b72bd4538c86e3ecc8f6515765a5 |
|
MD5 | fc7f91668fb9ff5931661b42193d7fe7 |
|
BLAKE2b-256 | 49f31a1e0f07b42341219955295604af2258196622514bee44f47e6f56262f40 |