Skip to main content

vLLM backend for kani

Project description

kani-ext-vllm

This repository adds the VLLMEngine.

This package is considered provisional and maintained on a best-effort basis.

To install this package, you can install it from PyPI:

$ pip install kani-ext-vllm

Alternatively, you can install it using the git source:

$ pip install git+https://github.com/zhudotexe/kani-ext-vllm.git@main

See https://docs.vllm.ai/en/latest/index.html for more information on vLLM.

Usage

This package provides 3 main methods of serving models with vLLM:

  • Offline mode
  • vLLM-Native API mode
  • OpenAI-Compatible API mode

These are generally equivalent, but offer slightly different options for each mode:

Mode Communication Multiple Parallel Models? Prompt Template/Parsing Best For
Offline Local No kani Low-level control over the model
vLLM API HTTP Yes kani Running multiple different models in parallel
OpenAI API HTTP Yes vLLM Fast iteration and testing multiple models; multimodal models

Offline Mode

from kani import Kani, chat_in_terminal
from kani.ext.vllm import VLLMEngine

engine = VLLMEngine(model_id="meta-llama/Meta-Llama-3-8B-Instruct")
ai = Kani(engine)
chat_in_terminal(ai)

vLLM-Native API Mode

[!NOTE] Using offline mode is preferred unless you need to load multiple models in parallel.

[!NOTE] The vLLM server will be started on a random free port. It will not be exposed to the wider internet (i.e, it binds to localhost).

When loading a model in API mode, the model's context length can not be read from the configuration, so you must pass the max_context_size.

from kani import Kani, chat_in_terminal
from kani.ext.vllm import VLLMServerEngine

engine = VLLMServerEngine(model_id="meta-llama/Meta-Llama-3-8B-Instruct", max_context_size=128000)
ai = Kani(engine)
chat_in_terminal(ai)

OpenAI-Compatible API Mode

[!NOTE] The vLLM server will be started on a random free port. It will not be exposed to the wider internet (i.e, it binds to localhost).

When loading a model in API mode, the model's context length can not be read from the configuration, so you must pass the max_context_size.

from kani import Kani, chat_in_terminal
from kani.ext.vllm import VLLMOpenAIEngine

engine = VLLMOpenAIEngine(model_id="meta-llama/Meta-Llama-3-8B-Instruct", max_context_size=128000)
ai = Kani(engine)
chat_in_terminal(ai)

Using Multiple GPUs

For multi-GPU support (probably needed), add model_load_kwargs={"tensor_parallel_size": 4}. Replace "4" with the number of GPUs you have available.

[!NOTE] If you are loading in an API mode, use vllm_args={"tensor_parallel_size": 4} instead.

Examples

Offline Mode

from kani.ext.vllm import VLLMEngine
from vllm import SamplingParams

model = VLLMEngine(
    model_id="mistralai/Mistral-Small-Instruct-2409",
    model_load_kwargs={"tensor_parallel_size": 2, "tokenizer_mode": "auto"},
    sampling_params=SamplingParams(temperature=0, max_tokens=2048),
)

vLLM-Native API Mode

from kani.ext.vllm import VLLMServerEngine

model = VLLMServerEngine(
    model_id="mistralai/Mistral-Small-Instruct-2409",
    max_context_size=32000,
    vllm_args={"tensor_parallel_size": 2, "tokenizer_mode": "auto"},
    # note that these should not be wrapped in SamplingParams!
    temperature=0,
    max_tokens=2048,
)

See https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#completions-api_1 for a list of valid decoding parameters that can be specified in the engine constructor.

OpenAI-Compatible API Mode

from kani.ext.vllm import VLLMOpenAIEngine

model = VLLMOpenAIEngine(
    model_id="Qwen/Qwen3-Omni-30B-A3B-Instruct",
    max_context_size=32768,
    vllm_args={"tensor_parallel_size": 2, "allowed_local_media_path": "/"},
    # note that these should not be wrapped in SamplingParams!
    temperature=0,
    max_tokens=2048,
)

See https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#chat-api_1 for a list of valid decoding parameters that can be specified in the engine constructor.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kani_ext_vllm-0.2.0.tar.gz (14.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kani_ext_vllm-0.2.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file kani_ext_vllm-0.2.0.tar.gz.

File metadata

  • Download URL: kani_ext_vllm-0.2.0.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kani_ext_vllm-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d7668c8aa1d6813433be42029ae737840c9cb09e0f7721d6a198437c8a2cc769
MD5 7bfc8bed0ec82dfd8f8b580beec2389a
BLAKE2b-256 ba202f0f7af1829ff0be84d84123b74fd9c7305040cf9584e3054bc34cc8485a

See more details on using hashes here.

Provenance

The following attestation bundles were made for kani_ext_vllm-0.2.0.tar.gz:

Publisher: pythonpublish.yml on zhudotexe/kani-ext-vllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kani_ext_vllm-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: kani_ext_vllm-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kani_ext_vllm-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b22ef00ebcdcbd7590148ffd121e11d2d8b7c0c6e1f7e077f0024cd81ca2d95c
MD5 80b525d728b05da661bab1d040549db4
BLAKE2b-256 33d2c96c5f2fe1e47a9a203930c20a43b9dbd41c9b824898e21f4d2c93955700

See more details on using hashes here.

Provenance

The following attestation bundles were made for kani_ext_vllm-0.2.0-py3-none-any.whl:

Publisher: pythonpublish.yml on zhudotexe/kani-ext-vllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page