Skip to main content

A high-throughput and memory-efficient inference and serving engine for LLMs

Project description

Neural Magic vLLM

About

vLLM is a fast and easy-to-use library for LLM inference and serving that Neural Magic regularly lands upstream improvements to. This fork is our opinionated focus on the latest LLM optimizations, such as quantization and sparsity.

Installation

nm-vllm is a Python library that contained pre-compiled C++ and CUDA (12.1) binaries.

Install it using pip:

pip install nm-vllm

In order to use the weight-sparsity kernels, like through sparsity="sparse_w16a16", install the extras using:

pip install nm-vllm[sparsity]

You can also build and install nm-vllm from source (this will take ~10 minutes):

git clone https://github.com/neuralmagic/nm-vllm.git
cd nm-vllm
pip install -e .

Quickstart

There are many sparse models already pushed up on our HF organization profiles, neuralmagic and nm-testing. You can find this collection of SparseGPT models ready for inference.

Here is a smoke test using a small test llama2-110M model train on storytelling:

from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/llama2.c-stories110M-pruned2.4", 
    sparsity="sparse_w16a16",   # If left off, model will be loaded as dense
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Here is a more realistic example of running a 50% sparse OpenHermes 2.5 Mistral 7B model finetuned for instruction-following:

from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/OpenHermes-2.5-Mistral-7B-pruned50",
    sparsity="sparse_w16a16",
    max_model_len=1024
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

You can also quickly use the same flow with an OpenAI-compatible model server:

python -m vllm.entrypoints.openai.api_server \
    --model nm-testing/OpenHermes-2.5-Mistral-7B-pruned50 \
    --sparsity sparse_w16a16

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

nm_vllm-0.1.0-cp311-cp311-manylinux_2_17_x86_64.whl (59.0 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

nm_vllm-0.1.0-cp310-cp310-manylinux_2_17_x86_64.whl (58.9 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

nm_vllm-0.1.0-cp39-cp39-manylinux_2_17_x86_64.whl (58.9 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

nm_vllm-0.1.0-cp38-cp38-manylinux_2_17_x86_64.whl (59.0 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page