Skip to main content

Sinapsis templates for LLM text completion using vLLM

Project description



Sinapsis vLLM

Sinapsis Templates for LLM text completion with vLLM

🐍 Installation🚀 Features📚 Usage example🌐 Webapps📙 Documentation🔍 License

The sinapsis-vllm module provides a suite of templates to run LLMs with vLLM, a high-throughput and memory-efficient inference engine for serving large language models.

🐍 Installation

Install using your package manager of choice. We encourage the use of uv

Example with uv:

  uv pip install sinapsis-vllm --extra-index-url https://pypi.sinapsis.tech

or with raw pip:

  pip install sinapsis-vllm --extra-index-url https://pypi.sinapsis.tech

[!IMPORTANT] Templates may require extra dependencies. For development, we recommend installing the package with all the optional dependencies:

with uv:

  uv pip install sinapsis-vllm[all] --extra-index-url https://pypi.sinapsis.tech

or with raw pip:

  pip install sinapsis-vllm[all] --extra-index-url https://pypi.sinapsis.tech

🚀 Features

Templates Supported

  • vLLMTextCompletion: Template for text completion using vLLM.

    Attributes
    • init_args(vLLMInitArgs, required): vLLM engine configuration arguments.
      • llm_model_name(str, required): The name or path of the LLM model to use (e.g., 'Qwen/Qwen3-1.7B').
      • tokenizer_mode(str, optional): The tokenizer mode. "auto" will use the fast tokenizer if available. Defaults to "auto".
      • trust_remote_code(bool, optional): Whether to allow custom code from the model repository. Defaults to False.
      • download_dir(str, optional): Directory to download and load the weights. Defaults to SINAPSIS_CACHE_DIR.
      • tensor_parallel_size(int, optional): Number of GPUs to use for distributed execution. Defaults to 1.
      • dtype(str, optional): Data type for model weights and activations (auto, half, float16, bfloat16, float, float32). Defaults to "auto".
      • quantization(str, optional): Method used to quantize the weights (awq, fp8, gptq, etc.). Defaults to None.
      • seed(int, optional): Random seed for reproducibility. Defaults to 0.
      • gpu_memory_utilization(float, optional): Fraction of GPU memory to be used for the model executor. Defaults to 0.9.
      • max_num_seqs(int, optional): Maximum number of sequences per iteration. Defaults to 256.
      • max_model_len(int, optional): Maximum sequence length for the model. Defaults to None.
      • cpu_offload_gb(float, optional): Amount of CPU memory (in GB) to offload weights to. Defaults to 0.
      • enforce_eager(bool, optional): Whether to enforce eager execution instead of CUDA graphs. Defaults to False.
      • disable_log_stats(bool, optional): Whether to disable logging of periodic runtime statistics. Defaults to False.
    • completion_args(vLLMCompletionArgs, required): Generation arguments to pass to the selected model.
      • temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to 0.7.
      • top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to 1.0.
      • top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to -1.
      • min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults to 0.0.
      • max_tokens(int, optional): Maximum number of tokens to generate per output sequence. Defaults to 16.
      • min_tokens(int, optional): Minimum number of tokens to generate before EOS or stop tokens. Defaults to 0.
      • presence_penalty(float, optional): Penalizes new tokens based on whether they appear in the text so far. Defaults to 0.0.
      • frequency_penalty(float, optional): Penalizes new tokens based on their frequency in the text so far. Defaults to 0.0.
      • repetition_penalty(float, optional): Penalizes new tokens based on whether they appear in the text so far. Defaults to 1.0.
      • seed(int, optional): Random seed to use for the generation. Defaults to None.
      • stop(str | list[str], optional): List of strings that stop the generation when they are generated. Defaults to None.
      • ignore_eos(bool, optional): Whether to ignore the EOS token and continue generating. Defaults to False.
      • bad_words(list[str], optional): List of words that are not allowed to be generated. Defaults to None.
      • response_format(vLLMResponseFormat, optional): Constrains the model output to a specific format.
        • type(str, optional): The output format type ('text' or 'json_object'). Defaults to "text".
        • schema(SchemaDefinition, optional): Schema defining the expected JSON structure when type is 'json_object'.
          • properties(dict, optional): Mapping of field names to type strings or PropertyDefinition objects.
          • required(list[str], optional): List of required field names.
    • chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.
    • rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.
    • system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.
    • pattern(str, optional): A regex pattern used to post-process the model's response.
    • keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
    • structured_output_key(str, optional): Key used to store parsed JSON structured output in the packet's generic_data when response_format type is 'json_object'. Defaults to "structured_output".
  • vLLMBatchTextCompletion: Template for batched text completion using vLLM's continuous batching engine. Processes multiple conversations in a single batch for improved throughput.

    Attributes

    Inherits all attributes from vLLMTextCompletion. Optimized for processing multiple text packets in parallel using vLLM's continuous batching.

  • vLLMStreamingTextCompletion: Streaming version of vLLMTextCompletion for real-time response generation.

    Attributes

    Inherits all attributes from vLLMTextCompletion. The template yields response chunks as they are generated rather than waiting for the complete response.

  • vLLMMultiModal: Template for multimodal (text + image) completion using vLLM. Supports vision-language models like Qwen-VL.

    Attributes
    • init_args(vLLMMultimodalInitArgs, required): vLLM multimodal engine arguments.
      • llm_model_name(str, required): The name or path of the VLM model to use (e.g., 'Qwen/Qwen2-VL-2B-Instruct-AWQ').
      • trust_remote_code(bool, optional): Whether to allow custom code from the model repository. Defaults to True.
      • limit_mm_per_prompt(dict, optional): Maximum number of multimodal items per prompt. Defaults to {"image": 1}.
      • All other attributes from vLLMInitArgs are also supported.
    • completion_args(vLLMCompletionArgs, required): Generation arguments to pass to the selected model. Same as vLLMTextCompletion.
    • chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.
    • rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.
    • system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.
    • pattern(str, optional): A regex pattern used to post-process the model's response.
    • keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
    • structured_output_key(str, optional): Key used to store parsed JSON structured output. Defaults to "structured_output".

[!TIP] Use CLI command sinapsis info --all-template-names to show a list with all the available Template names installed with Sinapsis Data Tools.

[!TIP] Use CLI command sinapsis info --example-template-config TEMPLATE_NAME to produce an example Agent config for the Template specified in TEMPLATE_NAME.

For example, for vLLMTextCompletion use sinapsis info --example-template-config vLLMTextCompletion to produce the following example config:

agent:
  name: my_test_agent
templates:
- template_name: InputTemplate
  class_name: InputTemplate
  attributes: {}
- template_name: vLLMTextCompletion
  class_name: vLLMTextCompletion
  template_input: InputTemplate
  attributes:
    init_args:
      llm_model_name: '`replace_me:<class ''str''>`'
      tokenizer_mode: auto
      trust_remote_code: false
      download_dir: /path/to/.cache/sinapsis
      tensor_parallel_size: 1
      dtype: auto
      quantization: null
      seed: 0
      gpu_memory_utilization: 0.9
      max_num_seqs: 256
      max_model_len: null
      cpu_offload_gb: 0
      enforce_eager: false
      disable_log_stats: false
    completion_args:
      temperature: 0.2
      top_p: 0.95
      top_k: 40
      presence_penalty: 0.0
      frequency_penalty: 0.0
      repetition_penalty: 1.0
      min_p: 0.0
      seed: null
      stop: null
      ignore_eos: false
      max_tokens: 16
      min_tokens: 0
      bad_words: null
      response_format:
        type_: text
        schema_:
          properties: '`replace_me:dict[str, str | sinapsis_vllm.helpers.schemas.PropertyDefinition]`'
          required: '`replace_me:list[str]`'
    chat_history_key: null
    rag_context_key: null
    system_prompt: null
    pattern: null
    keep_before: true
    structured_output_key: structured_output

📚 Usage example

The following agent passes text messages through TextPackets and retrieves responses from an LLM
Config
agent:
  name: chat_completion
  description: Chatbot agent using Qwen

templates:
- template_name: InputTemplate
  class_name: InputTemplate
  attributes: {}

- template_name: TextInput
  class_name: TextInput
  template_input: InputTemplate
  attributes:
    text: what is AI?

- template_name: vLLMTextCompletion
  class_name: vLLMTextCompletion
  template_input: TextInput
  attributes:
    init_args:
      llm_model_name: Qwen/Qwen3-1.7B
      max_model_len: 4096
      dtype: auto
      seed: 42
      gpu_memory_utilization: 0.9
      cpu_offload_gb: 2
      max_num_seqs: 8
      disable_log_stats: true
    completion_args:
      max_tokens: 1024
      temperature: 0.7
      seed: 42
    system_prompt: 'You are a helpful AI assistant'

Multimodal Example

The following agent processes an image and generates a description using a vision-language model:
Multimodal Config
agent:
  name: multimodal_chatbot
  description: Agent with support for multimodal vLLM model for image-to-text

templates:
  - template_name: InputTemplate
    class_name: InputTemplate
    attributes: {}

  - template_name: FolderImageDatasetCV2
    class_name: FolderImageDatasetCV2
    template_input: InputTemplate
    attributes:
      load_on_init: True
      data_dir: "artifacts"
      pattern: "test.png"

  - template_name: TextInput
    class_name: TextInput
    template_input: FolderImageDatasetCV2
    attributes:
      text: "Describe what you see in the image."

  - template_name: vLLMMultiModal
    class_name: vLLMMultiModal
    template_input: TextInput
    attributes:
      init_args:
        llm_model_name: "Qwen/Qwen2-VL-2B-Instruct-AWQ"
        max_model_len: 1024
        dtype: auto
        quantization: awq
        seed: 42
        gpu_memory_utilization: 0.95
        max_num_seqs: 1
        disable_log_stats: true
        enforce_eager: true
        limit_mm_per_prompt:
          image: 1
      completion_args:
        temperature: 0.7
        top_p: 0.8
        top_k: 20
        min_p: 0
        max_tokens: 1024
      system_prompt: "You are a helpful vision-language assistant."

[!NOTE] This example uses an AWQ quantized model for lower GPU memory requirements. For GPUs with limited memory, consider using quantized models (AWQ, GPTQ) or increasing cpu_offload_gb.

🌐 Webapps

You can interact with vLLM models using the generic chatbot webapp. The webapp works with any config by setting the AGENT_CONFIG_PATH environment variable.

[!IMPORTANT] To run the app you first need to clone this repository:

git clone git@github.com:Sinapsis-ai/sinapsis-chatbots.git
cd sinapsis-chatbots

[!NOTE] If you'd like to enable external app sharing in Gradio, export GRADIO_SHARE_APP=True

🐳 Docker

IMPORTANT This docker image depends on the sinapsis-nvidia:base image. Please refer to the official sinapsis instructions to Build with Docker.

  1. Build the sinapsis-chatbots image:
docker compose -f docker/compose.yaml build
  1. Start the vLLM chatbot container:
docker compose -f docker/compose_apps.yaml up sinapsis-vllm-chatbot -d

Or for the multimodal variant with image upload support:

docker compose -f docker/compose_apps.yaml up sinapsis-vllm-multimodal-chatbot -d
  1. Check the logs:
docker logs -f sinapsis-vllm-chatbot
  1. The logs will display the URL to access the webapp, e.g.,::
Running on local URL:  http://127.0.0.1:7860

NOTE: The url may be different, check the output of logs.

  1. To stop the app:
docker compose -f docker/compose_apps.yaml down

To use a different model, update the AGENT_CONFIG_PATH environmental variable to point to the desired YAML file.

💻 UV

To run the webapp using the uv package manager, follow these steps:

  1. Sync the virtual environment:
uv sync --frozen
  1. Install the wheel:
uv pip install sinapsis-vllm[all] --extra-index-url https://pypi.sinapsis.tech
  1. Run the chatbot webapp with vLLM config:
export AGENT_CONFIG_PATH=packages/sinapsis_vllm/src/sinapsis_vllm/configs/text_completion_webapp.yaml
uv run webapps/llama_cpp_simple_chatbot.py

Or for multimodal (image upload support):

export AGENT_CONFIG_PATH=packages/sinapsis_vllm/src/sinapsis_vllm/configs/multimodal_webapp.yaml
uv run webapps/llama_cpp_simple_chatbot.py
  1. The terminal will display the URL to access the webapp, e.g.:
Running on local URL:  http://127.0.0.1:7860

NOTE: The URL may vary; check the terminal output for the correct address.

📙 Documentation

Documentation for this and other sinapsis packages is available on the sinapsis website

Tutorials for different projects within sinapsis are available at sinapsis tutorials page

🔍 License

This project is licensed under the AGPLv3 license, which encourages open collaboration and sharing. For more details, please refer to the LICENSE file.

For commercial use, please refer to our official Sinapsis website for information on obtaining a commercial license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sinapsis_vllm-0.1.0.tar.gz (32.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sinapsis_vllm-0.1.0-py3-none-any.whl (32.2 kB view details)

Uploaded Python 3

File details

Details for the file sinapsis_vllm-0.1.0.tar.gz.

File metadata

  • Download URL: sinapsis_vllm-0.1.0.tar.gz
  • Upload date:
  • Size: 32.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.17

File hashes

Hashes for sinapsis_vllm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7b1706a5dd83746a49c40167b7baa1fd03530c3a89517f6db23b51ec0c393de0
MD5 7883161e55efff875fa0c7604ba843a4
BLAKE2b-256 5d2ae199db8c7d04d2db4c973237ddfa0fb7932c18231e94579ee22f08cec596

See more details on using hashes here.

File details

Details for the file sinapsis_vllm-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sinapsis_vllm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e6d030a8b946ab355a7aad75877e1aff818a05f5001d882448ba5bc919238eaf
MD5 397947dbf1369c4fb94e57e0fbb44f9e
BLAKE2b-256 47be5061f7706cca6ab8272549d4b8ce88019fe7ab8a5aac531baf535fc42b4f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page