vLLM backend for kani
Project description
kani-ext-vllm
This repository adds the VLLMEngine.
This package is considered provisional and maintained on a best-effort basis.
To install this package, you can install it from PyPI:
$ pip install kani-ext-vllm
Alternatively, you can install it using the git source:
$ pip install git+https://github.com/zhudotexe/kani-ext-vllm.git@main
See https://docs.vllm.ai/en/latest/index.html for more information on vLLM.
Usage
This package provides 2 main methods of serving models with vLLM: offline mode (preferred), and API mode. These are generally equivalent, and differ only in how Kani communicates with vLLM workers.
Generally, you should use offline mode unless you need to load multiple models in parallel.
Offline Mode
from kani import Kani, chat_in_terminal
from kani.ext.vllm import VLLMEngine
engine = VLLMEngine(model_id="meta-llama/Meta-Llama-3-8B-Instruct")
ai = Kani(engine)
chat_in_terminal(ai)
API Mode
[!ATTENTION] Using offline mode is preferred unless you need to load multiple models in parallel.
[!NOTE] The vLLM server will be started on a random free port. It will not be exposed to the wider internet (i.e, it binds to localhost).
When loading a model in API mode, the model's context length can not be read from the configuration, so you must pass
the max_context_len
.
from kani import Kani, chat_in_terminal
from kani.ext.vllm import VLLMServerEngine
engine = VLLMServerEngine(model_id="meta-llama/Meta-Llama-3-8B-Instruct", max_context_len=128000)
ai = Kani(engine)
chat_in_terminal(ai)
Command R
[!NOTE] Command R only supports loading in offline mode.
Command R's HF impl does not support the full 128k ctx length. Cohere recommends using vLLM, so here we are.
from kani import Kani, chat_in_terminal
from kani.ext.vllm import CommandRVLLMEngine
engine = CommandRVLLMEngine(model_id="CohereForAI/c4ai-command-r-v01")
ai = Kani(engine)
chat_in_terminal(ai)
Using Multiple GPUs
For multi-GPU support (probably needed), add model_load_kwargs={"tensor_parallel_size": 4}
. Replace "4" with the
number of GPUs you have available.
[!NOTE] If you are loading in API mode, use
vllm_args={"tensor_parallel_size": 4}
instead.
Examples
Offline Mode
from kani.ext.vllm import VLLMEngine
from vllm import SamplingParams
model = VLLMEngine(
model_id="mistralai/Mistral-Small-Instruct-2409",
model_load_kwargs={"tensor_parallel_size": 2, "tokenizer_mode": "auto"},
sampling_params=SamplingParams(temperature=0, max_tokens=2048),
)
API Mode
from kani.ext.vllm import VLLMServerEngine
model = VLLMServerEngine(
model_id="mistralai/Mistral-Small-Instruct-2409",
max_context_len=32000,
vllm_args={"tensor_parallel_size": 2, "tokenizer_mode": "auto"},
# note that these should not be wrapped in SamplingParams!
temperature=0,
max_tokens=2048,
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file kani_ext_vllm-0.0.8.tar.gz
.
File metadata
- Download URL: kani_ext_vllm-0.0.8.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
4b2dd4487dc35dc8668068d6a5c8da3a593e66757f58afff363f618ac24773e0
|
|
MD5 |
315ee0b08df01f939b50acf5f44ce0a5
|
|
BLAKE2b-256 |
203d1941d7ec4cadbd0b5b98dfdcdf7bb39e9204b9ee500dc997b94f02067e93
|
Provenance
The following attestation bundles were made for kani_ext_vllm-0.0.8.tar.gz
:
Publisher:
pythonpublish.yml
on zhudotexe/kani-ext-vllm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1
-
Predicate type:
https://docs.pypi.org/attestations/publish/v1
-
Subject name:
kani_ext_vllm-0.0.8.tar.gz
-
Subject digest:
4b2dd4487dc35dc8668068d6a5c8da3a593e66757f58afff363f618ac24773e0
- Sigstore transparency entry: 168316875
- Sigstore integration time:
-
Permalink:
zhudotexe/kani-ext-vllm@aa1214c7396103f70e3b4187b86c5775cd7950ba
-
Branch / Tag:
refs/tags/v0.0.8
- Owner: https://github.com/zhudotexe
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com
-
Runner Environment:
github-hosted
-
Publication workflow:
pythonpublish.yml@aa1214c7396103f70e3b4187b86c5775cd7950ba
-
Trigger Event:
release
-
Statement type:
File details
Details for the file kani_ext_vllm-0.0.8-py3-none-any.whl
.
File metadata
- Download URL: kani_ext_vllm-0.0.8-py3-none-any.whl
- Upload date:
- Size: 12.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
b5874b581976961ea17e8fc7bb66cd0047745c3c825c4573047fd8649c4ff2e1
|
|
MD5 |
9ad8feb715d41497016f4395b2cc707c
|
|
BLAKE2b-256 |
15c7690ff950ee7e794e43cca861f89c48090f4735d5b9daac75b077809c257b
|
Provenance
The following attestation bundles were made for kani_ext_vllm-0.0.8-py3-none-any.whl
:
Publisher:
pythonpublish.yml
on zhudotexe/kani-ext-vllm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1
-
Predicate type:
https://docs.pypi.org/attestations/publish/v1
-
Subject name:
kani_ext_vllm-0.0.8-py3-none-any.whl
-
Subject digest:
b5874b581976961ea17e8fc7bb66cd0047745c3c825c4573047fd8649c4ff2e1
- Sigstore transparency entry: 168316876
- Sigstore integration time:
-
Permalink:
zhudotexe/kani-ext-vllm@aa1214c7396103f70e3b4187b86c5775cd7950ba
-
Branch / Tag:
refs/tags/v0.0.8
- Owner: https://github.com/zhudotexe
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com
-
Runner Environment:
github-hosted
-
Publication workflow:
pythonpublish.yml@aa1214c7396103f70e3b4187b86c5775cd7950ba
-
Trigger Event:
release
-
Statement type: