vLLM backend for kani
Project description
kani-ext-vllm
This Kani extension repository adds 3 engines for using vLLM to deploy LLMs on local hardware.
vLLM is an LLM deployment platform optimized for GPU memory efficiency and throughput. This extension adds Kani engines to use vLLM engines in offline mode, manage a vLLM server, or connect to an existing vLLM server depending on the use case.
To install this package, you can install it from PyPI:
$ pip install kani-ext-vllm
Alternatively, you can install it using the git source:
$ pip install git+https://github.com/zhudotexe/kani-ext-vllm.git@main
See https://docs.vllm.ai/en/latest/index.html for more information on vLLM.
Usage
This package provides 3 main methods of serving models with vLLM:
- Offline mode
- vLLM-Native API mode
- OpenAI-Compatible API mode
These are generally equivalent, but offer slightly different options for each mode:
| Mode | Communication | Multiple Parallel Models? | Prompt Template/Parsing | Best For |
|---|---|---|---|---|
| Offline | Local | No | kani | Low-level control over the model |
| vLLM API | HTTP | Yes | kani | Running multiple different models in parallel |
| OpenAI API | HTTP | Yes | vLLM | Fast iteration and testing multiple models; multimodal models |
Offline Mode
from kani import Kani, chat_in_terminal
from kani.ext.vllm import VLLMEngine
engine = VLLMEngine(model_id="meta-llama/Meta-Llama-3-8B-Instruct")
ai = Kani(engine)
chat_in_terminal(ai)
vLLM-Native API Mode
The API mode can be used to connect to an existing running vLLM server or to start a managed vLLM server.
Connecting to a Running Server
from kani import Kani, chat_in_terminal
from kani.ext.vllm import VLLMServerEngine
engine = VLLMServerEngine(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
vllm_host="127.0.0.1",
vllm_port=8000,
use_managed_server=False,
)
ai = Kani(engine)
chat_in_terminal(ai)
Managed Server
[!NOTE] The vLLM server will be started on a random free port. It will not be exposed to the wider internet (i.e, it binds to localhost).
from kani import Kani, chat_in_terminal
from kani.ext.vllm import VLLMServerEngine
engine = VLLMServerEngine(model_id="meta-llama/Meta-Llama-3-8B-Instruct")
ai = Kani(engine)
chat_in_terminal(ai)
OpenAI-Compatible API Mode
Connecting to a Running Server
from kani import Kani, chat_in_terminal
from kani.ext.vllm import VLLMOpenAIEngine
engine = VLLMOpenAIEngine(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
vllm_host="127.0.0.1",
vllm_port=8000,
use_managed_server=False,
)
ai = Kani(engine)
chat_in_terminal(ai)
Managed Server
[!NOTE] The vLLM server will be started on a random free port. It will not be exposed to the wider internet (i.e, it binds to localhost).
from kani import Kani, chat_in_terminal
from kani.ext.vllm import VLLMOpenAIEngine
engine = VLLMOpenAIEngine(model_id="meta-llama/Meta-Llama-3-8B-Instruct")
ai = Kani(engine)
chat_in_terminal(ai)
Using Multiple GPUs
For multi-GPU support (probably needed), add model_load_kwargs={"tensor_parallel_size": 4}. Replace "4" with the
number of GPUs you have available.
[!NOTE] If you are loading in an API mode, use
vllm_args={"tensor_parallel_size": 4}instead.
Examples
Offline Mode
from kani.ext.vllm import VLLMEngine
from vllm import SamplingParams
model = VLLMEngine(
model_id="mistralai/Mistral-Small-Instruct-2409",
model_load_kwargs={"tensor_parallel_size": 2, "tokenizer_mode": "auto"},
sampling_params=SamplingParams(temperature=0, max_tokens=2048),
)
vLLM-Native API Mode
from kani.ext.vllm import VLLMServerEngine
model = VLLMServerEngine(
model_id="mistralai/Mistral-Small-Instruct-2409",
vllm_args={"tensor_parallel_size": 2, "tokenizer_mode": "auto"},
# note that these should not be wrapped in SamplingParams!
temperature=0,
max_tokens=2048,
)
See https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#completions-api_1 for a list of valid decoding parameters that can be specified in the engine constructor.
See https://docs.vllm.ai/en/stable/cli/serve/ for a list of valid arguments to vllm_args.
OpenAI-Compatible API Mode
from kani.ext.vllm import VLLMOpenAIEngine
model = VLLMOpenAIEngine(
model_id="Qwen/Qwen3-Omni-30B-A3B-Instruct",
vllm_args={"tensor_parallel_size": 2, "allowed_local_media_path": "/"},
# note that these should not be wrapped in SamplingParams!
temperature=0,
max_tokens=2048,
)
See https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#chat-api_1 for a list of valid decoding parameters that can be specified in the engine constructor.
See https://docs.vllm.ai/en/stable/cli/serve/ for a list of valid arguments to vllm_args.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kani_ext_vllm-0.2.2.tar.gz.
File metadata
- Download URL: kani_ext_vllm-0.2.2.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0bf7f2827839876d63fca24a3afd80688e0c659c90ca730f16666046353ab951
|
|
| MD5 |
2c99b3df75b9be083958aa6e483eb2b9
|
|
| BLAKE2b-256 |
1afe8eaf893629f45d3b938c6370d8ab98c032530c1de0d51409f7ac081b0d4e
|
Provenance
The following attestation bundles were made for kani_ext_vllm-0.2.2.tar.gz:
Publisher:
pythonpublish.yml on zhudotexe/kani-ext-vllm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kani_ext_vllm-0.2.2.tar.gz -
Subject digest:
0bf7f2827839876d63fca24a3afd80688e0c659c90ca730f16666046353ab951 - Sigstore transparency entry: 751896397
- Sigstore integration time:
-
Permalink:
zhudotexe/kani-ext-vllm@1534f8d4110ef28d1773cafce7cecd9aa8d16e2d -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/zhudotexe
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pythonpublish.yml@1534f8d4110ef28d1773cafce7cecd9aa8d16e2d -
Trigger Event:
release
-
Statement type:
File details
Details for the file kani_ext_vllm-0.2.2-py3-none-any.whl.
File metadata
- Download URL: kani_ext_vllm-0.2.2-py3-none-any.whl
- Upload date:
- Size: 14.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
658aea10351baa311e529bf163b042b2050f672190862abf7472a573dccd986c
|
|
| MD5 |
3914fc3f5477012d0f49e160b49f9bcd
|
|
| BLAKE2b-256 |
641c47eba18c94e2cbce883a28839aa43a93c9b89ca69a7ddfa1bf6efa9e30ff
|
Provenance
The following attestation bundles were made for kani_ext_vllm-0.2.2-py3-none-any.whl:
Publisher:
pythonpublish.yml on zhudotexe/kani-ext-vllm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kani_ext_vllm-0.2.2-py3-none-any.whl -
Subject digest:
658aea10351baa311e529bf163b042b2050f672190862abf7472a573dccd986c - Sigstore transparency entry: 751896410
- Sigstore integration time:
-
Permalink:
zhudotexe/kani-ext-vllm@1534f8d4110ef28d1773cafce7cecd9aa8d16e2d -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/zhudotexe
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pythonpublish.yml@1534f8d4110ef28d1773cafce7cecd9aa8d16e2d -
Trigger Event:
release
-
Statement type: