Sinapsis templates for LLM text completion using vLLM
Project description
Sinapsis vLLM
Sinapsis Templates for LLM text completion with vLLM
🐍 Installation • 🚀 Features • 📚 Usage example • 🌐 Webapps • 📙 Documentation • 🔍 License
The sinapsis-vllm module provides a suite of templates to run LLMs with vLLM, a high-throughput and memory-efficient inference engine for serving large language models.
🐍 Installation
Install using your package manager of choice. We encourage the use of uv
Example with uv:
uv pip install sinapsis-vllm --extra-index-url https://pypi.sinapsis.tech
or with raw pip:
pip install sinapsis-vllm --extra-index-url https://pypi.sinapsis.tech
[!IMPORTANT] Templates may require extra dependencies. For development, we recommend installing the package with all the optional dependencies:
with uv:
uv pip install sinapsis-vllm[all] --extra-index-url https://pypi.sinapsis.tech
or with raw pip:
pip install sinapsis-vllm[all] --extra-index-url https://pypi.sinapsis.tech
🚀 Features
Templates Supported
-
vLLMTextCompletion: Template for text completion using vLLM.
Attributes
init_args(vLLMInitArgs, required): vLLM engine configuration arguments.llm_model_name(str, required): The name or path of the LLM model to use (e.g., 'Qwen/Qwen3-1.7B').tokenizer_mode(str, optional): The tokenizer mode."auto"will use the fast tokenizer if available. Defaults to"auto".trust_remote_code(bool, optional): Whether to allow custom code from the model repository. Defaults toFalse.download_dir(str, optional): Directory to download and load the weights. Defaults toSINAPSIS_CACHE_DIR.tensor_parallel_size(int, optional): Number of GPUs to use for distributed execution. Defaults to1.dtype(str, optional): Data type for model weights and activations (auto, half, float16, bfloat16, float, float32). Defaults to"auto".quantization(str, optional): Method used to quantize the weights (awq, fp8, gptq, etc.). Defaults toNone.seed(int, optional): Random seed for reproducibility. Defaults to0.gpu_memory_utilization(float, optional): Fraction of GPU memory to be used for the model executor. Defaults to0.9.max_num_seqs(int, optional): Maximum number of sequences per iteration. Defaults to256.max_model_len(int, optional): Maximum sequence length for the model. Defaults toNone.cpu_offload_gb(float, optional): Amount of CPU memory (in GB) to offload weights to. Defaults to0.enforce_eager(bool, optional): Whether to enforce eager execution instead of CUDA graphs. Defaults toFalse.disable_log_stats(bool, optional): Whether to disable logging of periodic runtime statistics. Defaults toFalse.
completion_args(vLLMCompletionArgs, required): Generation arguments to pass to the selected model.temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to0.7.top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to1.0.top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to-1.min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults to0.0.max_tokens(int, optional): Maximum number of tokens to generate per output sequence. Defaults to16.min_tokens(int, optional): Minimum number of tokens to generate before EOS or stop tokens. Defaults to0.presence_penalty(float, optional): Penalizes new tokens based on whether they appear in the text so far. Defaults to0.0.frequency_penalty(float, optional): Penalizes new tokens based on their frequency in the text so far. Defaults to0.0.repetition_penalty(float, optional): Penalizes new tokens based on whether they appear in the text so far. Defaults to1.0.seed(int, optional): Random seed to use for the generation. Defaults toNone.stop(str | list[str], optional): List of strings that stop the generation when they are generated. Defaults toNone.ignore_eos(bool, optional): Whether to ignore the EOS token and continue generating. Defaults toFalse.bad_words(list[str], optional): List of words that are not allowed to be generated. Defaults toNone.response_format(vLLMResponseFormat, optional): Constrains the model output to a specific format.type(str, optional): The output format type ('text' or 'json_object'). Defaults to"text".schema(SchemaDefinition, optional): Schema defining the expected JSON structure when type is 'json_object'.properties(dict, optional): Mapping of field names to type strings or PropertyDefinition objects.required(list[str], optional): List of required field names.
chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.pattern(str, optional): A regex pattern used to post-process the model's response.keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.structured_output_key(str, optional): Key used to store parsed JSON structured output in the packet's generic_data when response_format type is 'json_object'. Defaults to"structured_output".
-
vLLMBatchTextCompletion: Template for batched text completion using vLLM's continuous batching engine. Processes multiple conversations in a single batch for improved throughput.
Attributes
Inherits all attributes from
vLLMTextCompletion. Optimized for processing multiple text packets in parallel using vLLM's continuous batching. -
vLLMStreamingTextCompletion: Streaming version of vLLMTextCompletion for real-time response generation.
Attributes
Inherits all attributes from
vLLMTextCompletion. The template yields response chunks as they are generated rather than waiting for the complete response. -
vLLMMultiModal: Template for multimodal (text + image) completion using vLLM. Supports vision-language models like Qwen-VL.
Attributes
init_args(vLLMMultimodalInitArgs, required): vLLM multimodal engine arguments.llm_model_name(str, required): The name or path of the VLM model to use (e.g., 'Qwen/Qwen2-VL-2B-Instruct-AWQ').trust_remote_code(bool, optional): Whether to allow custom code from the model repository. Defaults toTrue.limit_mm_per_prompt(dict, optional): Maximum number of multimodal items per prompt. Defaults to{"image": 1}.- All other attributes from
vLLMInitArgsare also supported.
completion_args(vLLMCompletionArgs, required): Generation arguments to pass to the selected model. Same asvLLMTextCompletion.chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.pattern(str, optional): A regex pattern used to post-process the model's response.keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.structured_output_key(str, optional): Key used to store parsed JSON structured output. Defaults to"structured_output".
[!TIP] Use CLI command
sinapsis info --all-template-namesto show a list with all the available Template names installed with Sinapsis Data Tools.
[!TIP] Use CLI command
sinapsis info --example-template-config TEMPLATE_NAMEto produce an example Agent config for the Template specified in TEMPLATE_NAME.
For example, for vLLMTextCompletion use sinapsis info --example-template-config vLLMTextCompletion to produce the following example config:
agent:
name: my_test_agent
templates:
- template_name: InputTemplate
class_name: InputTemplate
attributes: {}
- template_name: vLLMTextCompletion
class_name: vLLMTextCompletion
template_input: InputTemplate
attributes:
init_args:
llm_model_name: '`replace_me:<class ''str''>`'
tokenizer_mode: auto
trust_remote_code: false
download_dir: /path/to/.cache/sinapsis
tensor_parallel_size: 1
dtype: auto
quantization: null
seed: 0
gpu_memory_utilization: 0.9
max_num_seqs: 256
max_model_len: null
cpu_offload_gb: 0
enforce_eager: false
disable_log_stats: false
completion_args:
temperature: 0.2
top_p: 0.95
top_k: 40
presence_penalty: 0.0
frequency_penalty: 0.0
repetition_penalty: 1.0
min_p: 0.0
seed: null
stop: null
ignore_eos: false
max_tokens: 16
min_tokens: 0
bad_words: null
response_format:
type_: text
schema_:
properties: '`replace_me:dict[str, str | sinapsis_vllm.helpers.schemas.PropertyDefinition]`'
required: '`replace_me:list[str]`'
chat_history_key: null
rag_context_key: null
system_prompt: null
pattern: null
keep_before: true
structured_output_key: structured_output
📚 Usage example
The following agent passes text messages through TextPackets and retrieves responses from an LLMConfig
agent:
name: chat_completion
description: Chatbot agent using Qwen
templates:
- template_name: InputTemplate
class_name: InputTemplate
attributes: {}
- template_name: TextInput
class_name: TextInput
template_input: InputTemplate
attributes:
text: what is AI?
- template_name: vLLMTextCompletion
class_name: vLLMTextCompletion
template_input: TextInput
attributes:
init_args:
llm_model_name: Qwen/Qwen3-1.7B
max_model_len: 4096
dtype: auto
seed: 42
gpu_memory_utilization: 0.9
cpu_offload_gb: 2
max_num_seqs: 8
disable_log_stats: true
completion_args:
max_tokens: 1024
temperature: 0.7
seed: 42
system_prompt: 'You are a helpful AI assistant'
Multimodal Example
The following agent processes an image and generates a description using a vision-language model:Multimodal Config
agent:
name: multimodal_chatbot
description: Agent with support for multimodal vLLM model for image-to-text
templates:
- template_name: InputTemplate
class_name: InputTemplate
attributes: {}
- template_name: FolderImageDatasetCV2
class_name: FolderImageDatasetCV2
template_input: InputTemplate
attributes:
load_on_init: True
data_dir: "artifacts"
pattern: "test.png"
- template_name: TextInput
class_name: TextInput
template_input: FolderImageDatasetCV2
attributes:
text: "Describe what you see in the image."
- template_name: vLLMMultiModal
class_name: vLLMMultiModal
template_input: TextInput
attributes:
init_args:
llm_model_name: "Qwen/Qwen2-VL-2B-Instruct-AWQ"
max_model_len: 1024
dtype: auto
quantization: awq
seed: 42
gpu_memory_utilization: 0.95
max_num_seqs: 1
disable_log_stats: true
enforce_eager: true
limit_mm_per_prompt:
image: 1
completion_args:
temperature: 0.7
top_p: 0.8
top_k: 20
min_p: 0
max_tokens: 1024
system_prompt: "You are a helpful vision-language assistant."
[!NOTE] This example uses an AWQ quantized model for lower GPU memory requirements. For GPUs with limited memory, consider using quantized models (AWQ, GPTQ) or increasing
cpu_offload_gb.
🌐 Webapps
You can interact with vLLM models using the generic chatbot webapp. The webapp works with any config by setting the AGENT_CONFIG_PATH environment variable.
[!IMPORTANT] To run the app you first need to clone this repository:
git clone git@github.com:Sinapsis-ai/sinapsis-chatbots.git
cd sinapsis-chatbots
[!NOTE] If you'd like to enable external app sharing in Gradio,
export GRADIO_SHARE_APP=True
🐳 Docker
IMPORTANT This docker image depends on the sinapsis-nvidia:base image. Please refer to the official sinapsis instructions to Build with Docker.
- Build the sinapsis-chatbots image:
docker compose -f docker/compose.yaml build
- Start the vLLM chatbot container:
docker compose -f docker/compose_apps.yaml up sinapsis-vllm-chatbot -d
Or for the multimodal variant with image upload support:
docker compose -f docker/compose_apps.yaml up sinapsis-vllm-multimodal-chatbot -d
- Check the logs:
docker logs -f sinapsis-vllm-chatbot
- The logs will display the URL to access the webapp, e.g.,::
Running on local URL: http://127.0.0.1:7860
NOTE: The url may be different, check the output of logs.
- To stop the app:
docker compose -f docker/compose_apps.yaml down
To use a different model, update the AGENT_CONFIG_PATH environmental variable to point to the desired YAML file.
💻 UV
To run the webapp using the uv package manager, follow these steps:
- Sync the virtual environment:
uv sync --frozen
- Install the wheel:
uv pip install sinapsis-vllm[all] --extra-index-url https://pypi.sinapsis.tech
- Run the chatbot webapp with vLLM config:
export AGENT_CONFIG_PATH=webapps/configs/llama_cpp_simple_chatbot/vllm_text_completion.yaml
uv run webapps/llama_cpp_simple_chatbot.py
Or for multimodal (image upload support):
export AGENT_CONFIG_PATH=webapps/configs/llama_cpp_simple_chatbot/vllm_multimodal.yaml
uv run webapps/llama_cpp_simple_chatbot.py
- The terminal will display the URL to access the webapp, e.g.:
Running on local URL: http://127.0.0.1:7860
NOTE: The URL may vary; check the terminal output for the correct address.
📙 Documentation
Documentation for this and other sinapsis packages is available on the sinapsis website
Tutorials for different projects within sinapsis are available at sinapsis tutorials page
🔍 License
This project is licensed under the AGPLv3 license, which encourages open collaboration and sharing. For more details, please refer to the LICENSE file.
For commercial use, please refer to our official Sinapsis website for information on obtaining a commercial license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sinapsis_vllm-0.1.2.tar.gz.
File metadata
- Download URL: sinapsis_vllm-0.1.2.tar.gz
- Upload date:
- Size: 32.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1811733a4f605233a67446e019963c4dde7cdc73161f3da0acf4a887f70a27f
|
|
| MD5 |
ba1e2a8717c4c71a25751227e7dff761
|
|
| BLAKE2b-256 |
26c67618ca74aae7c1e61800fa93b301af76740c57d59fa073c595fc2296d63b
|
File details
Details for the file sinapsis_vllm-0.1.2-py3-none-any.whl.
File metadata
- Download URL: sinapsis_vllm-0.1.2-py3-none-any.whl
- Upload date:
- Size: 32.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fd2c7b12b5b92e59f1564ba35976fee7d377b3d4a1296eab7e660fd1d5fe333
|
|
| MD5 |
c37c7d38bb2d342f4686dd13ca2a3b09
|
|
| BLAKE2b-256 |
b778feb9a5ba77fbbcf87ff00ed00996006d422266476b95f71e0c78776265cc
|