A high-throughput and memory-efficient inference and serving engine for LLMs
Project description
Neural Magic vLLM
About
vLLM is a fast and easy-to-use library for LLM inference and serving that Neural Magic regularly lands upstream improvements to. This fork is our opinionated focus on the latest LLM optimizations, such as quantization and sparsity.
Installation
nm-vllm
is a Python library that contained pre-compiled C++ and CUDA (12.1) binaries.
Install it using pip:
pip install nm-vllm
In order to use the weight-sparsity kernels, like through sparsity="sparse_w16a16"
, install the extras using:
pip install nm-vllm[sparsity]
You can also build and install nm-vllm
from source (this will take ~10 minutes):
git clone https://github.com/neuralmagic/nm-vllm.git
cd nm-vllm
pip install -e .
Quickstart
There are many sparse models already pushed up on our HF organization profiles, neuralmagic and nm-testing. You can find this collection of SparseGPT models ready for inference.
Here is a smoke test using a small test llama2-110M
model train on storytelling:
from vllm import LLM, SamplingParams
model = LLM(
"nm-testing/llama2.c-stories110M-pruned2.4",
sparsity="sparse_w16a16", # If left off, model will be loaded as dense
)
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
Here is a more realistic example of running a 50% sparse OpenHermes 2.5 Mistral 7B model finetuned for instruction-following:
from vllm import LLM, SamplingParams
model = LLM(
"nm-testing/OpenHermes-2.5-Mistral-7B-pruned50",
sparsity="sparse_w16a16",
max_model_len=1024
)
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
You can also quickly use the same flow with an OpenAI-compatible model server:
python -m vllm.entrypoints.openai.api_server \
--model nm-testing/OpenHermes-2.5-Mistral-7B-pruned50 \
--sparsity sparse_w16a16
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for nm_vllm-0.1.0-cp311-cp311-manylinux_2_17_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6485686808c5dc2c60a68438cf45fa735fbcdc7185ee8c44c52921990186b6f |
|
MD5 | 0ca13ea5ee203a8833a19715222aee04 |
|
BLAKE2b-256 | dd9fa8b3e254e3d463bcfa5f995d3db9119a1e00d3624d9951b1f4a9c47e5124 |
Hashes for nm_vllm-0.1.0-cp310-cp310-manylinux_2_17_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 115635e84d544d3a1c3477b0e3b75c90d5869e85e4d137aff5487a4e284017b3 |
|
MD5 | d5785dfb71ce4bcadead0ff74f722eae |
|
BLAKE2b-256 | 560a51c79f009668642020ca119b2551fbe0e02098142ff8d90c05bf530165f3 |
Hashes for nm_vllm-0.1.0-cp39-cp39-manylinux_2_17_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3bd36b2a18c1942c2aacacdc6c1e788356b4d74cb311b1666a5f7415044f075 |
|
MD5 | 48cbe4e01394443718bb935f963f3e32 |
|
BLAKE2b-256 | 116435045db33d8e7903b93daa06c910e9c29530aaef927eaeb83336b69ddad5 |
Hashes for nm_vllm-0.1.0-cp38-cp38-manylinux_2_17_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 053456cbb01db3caadc730510ac471d73d7f87f65b18f1ec0f53d97a3040f7ab |
|
MD5 | 4b46e8ee8bbea792e2d2000cb7eb2d62 |
|
BLAKE2b-256 | 403f0409ebdcd069bcda667fa59d8251dfaca27070518523b5fc7f795688297f |