Skip to main content

Sinapsis templates for LLM text completion using vLLM

Project description



Sinapsis vLLM

Sinapsis Templates for LLM text completion with vLLM

🐍 Installation🚀 Features📚 Usage example🌐 Webapps📙 Documentation🔍 License

The sinapsis-vllm module provides a suite of templates to run LLMs with vLLM, a high-throughput and memory-efficient inference engine for serving large language models.

🐍 Installation

Install using your package manager of choice. We encourage the use of uv

Example with uv:

  uv pip install sinapsis-vllm --extra-index-url https://pypi.sinapsis.tech

or with raw pip:

  pip install sinapsis-vllm --extra-index-url https://pypi.sinapsis.tech

[!IMPORTANT] Templates may require extra dependencies. For development, we recommend installing the package with all the optional dependencies:

with uv:

  uv pip install sinapsis-vllm[all] --extra-index-url https://pypi.sinapsis.tech

or with raw pip:

  pip install sinapsis-vllm[all] --extra-index-url https://pypi.sinapsis.tech

🚀 Features

Templates Supported

  • vLLMTextCompletion: Template for text completion using vLLM.

    Attributes
    • init_args(vLLMInitArgs, required): vLLM engine configuration arguments.
      • llm_model_name(str, required): The name or path of the LLM model to use (e.g., 'Qwen/Qwen3-1.7B').
      • tokenizer_mode(str, optional): The tokenizer mode. "auto" will use the fast tokenizer if available. Defaults to "auto".
      • trust_remote_code(bool, optional): Whether to allow custom code from the model repository. Defaults to False.
      • download_dir(str, optional): Directory to download and load the weights. Defaults to SINAPSIS_CACHE_DIR.
      • tensor_parallel_size(int, optional): Number of GPUs to use for distributed execution. Defaults to 1.
      • dtype(str, optional): Data type for model weights and activations (auto, half, float16, bfloat16, float, float32). Defaults to "auto".
      • quantization(str, optional): Method used to quantize the weights (awq, fp8, gptq, etc.). Defaults to None.
      • seed(int, optional): Random seed for reproducibility. Defaults to 0.
      • gpu_memory_utilization(float, optional): Fraction of GPU memory to be used for the model executor. Defaults to 0.9.
      • max_num_seqs(int, optional): Maximum number of sequences per iteration. Defaults to 256.
      • max_model_len(int, optional): Maximum sequence length for the model. Defaults to None.
      • cpu_offload_gb(float, optional): Amount of CPU memory (in GB) to offload weights to. Defaults to 0.
      • enforce_eager(bool, optional): Whether to enforce eager execution instead of CUDA graphs. Defaults to False.
      • disable_log_stats(bool, optional): Whether to disable logging of periodic runtime statistics. Defaults to False.
    • completion_args(vLLMCompletionArgs, required): Generation arguments to pass to the selected model.
      • temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to 0.7.
      • top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to 1.0.
      • top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to -1.
      • min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults to 0.0.
      • max_tokens(int, optional): Maximum number of tokens to generate per output sequence. Defaults to 16.
      • min_tokens(int, optional): Minimum number of tokens to generate before EOS or stop tokens. Defaults to 0.
      • presence_penalty(float, optional): Penalizes new tokens based on whether they appear in the text so far. Defaults to 0.0.
      • frequency_penalty(float, optional): Penalizes new tokens based on their frequency in the text so far. Defaults to 0.0.
      • repetition_penalty(float, optional): Penalizes new tokens based on whether they appear in the text so far. Defaults to 1.0.
      • seed(int, optional): Random seed to use for the generation. Defaults to None.
      • stop(str | list[str], optional): List of strings that stop the generation when they are generated. Defaults to None.
      • ignore_eos(bool, optional): Whether to ignore the EOS token and continue generating. Defaults to False.
      • bad_words(list[str], optional): List of words that are not allowed to be generated. Defaults to None.
      • response_format(vLLMResponseFormat, optional): Constrains the model output to a specific format.
        • type(str, optional): The output format type ('text' or 'json_object'). Defaults to "text".
        • schema(SchemaDefinition, optional): Schema defining the expected JSON structure when type is 'json_object'.
          • properties(dict, optional): Mapping of field names to type strings or PropertyDefinition objects.
          • required(list[str], optional): List of required field names.
    • chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.
    • rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.
    • system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.
    • pattern(str, optional): A regex pattern used to post-process the model's response.
    • keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
    • structured_output_key(str, optional): Key used to store parsed JSON structured output in the packet's generic_data when response_format type is 'json_object'. Defaults to "structured_output".
  • vLLMBatchTextCompletion: Template for batched text completion using vLLM's continuous batching engine. Processes multiple conversations in a single batch for improved throughput.

    Attributes

    Inherits all attributes from vLLMTextCompletion. Optimized for processing multiple text packets in parallel using vLLM's continuous batching.

  • vLLMStreamingTextCompletion: Streaming version of vLLMTextCompletion for real-time response generation.

    Attributes

    Inherits all attributes from vLLMTextCompletion. The template yields response chunks as they are generated rather than waiting for the complete response.

  • vLLMMultiModal: Template for multimodal (text + image) completion using vLLM. Supports vision-language models like Qwen-VL.

    Attributes
    • init_args(vLLMMultimodalInitArgs, required): vLLM multimodal engine arguments.
      • llm_model_name(str, required): The name or path of the VLM model to use (e.g., 'Qwen/Qwen2-VL-2B-Instruct-AWQ').
      • trust_remote_code(bool, optional): Whether to allow custom code from the model repository. Defaults to True.
      • limit_mm_per_prompt(dict, optional): Maximum number of multimodal items per prompt. Defaults to {"image": 1}.
      • All other attributes from vLLMInitArgs are also supported.
    • completion_args(vLLMCompletionArgs, required): Generation arguments to pass to the selected model. Same as vLLMTextCompletion.
    • chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.
    • rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.
    • system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.
    • pattern(str, optional): A regex pattern used to post-process the model's response.
    • keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
    • structured_output_key(str, optional): Key used to store parsed JSON structured output. Defaults to "structured_output".

[!TIP] Use CLI command sinapsis info --all-template-names to show a list with all the available Template names installed with Sinapsis Data Tools.

[!TIP] Use CLI command sinapsis info --example-template-config TEMPLATE_NAME to produce an example Agent config for the Template specified in TEMPLATE_NAME.

For example, for vLLMTextCompletion use sinapsis info --example-template-config vLLMTextCompletion to produce the following example config:

agent:
  name: my_test_agent
templates:
- template_name: InputTemplate
  class_name: InputTemplate
  attributes: {}
- template_name: vLLMTextCompletion
  class_name: vLLMTextCompletion
  template_input: InputTemplate
  attributes:
    init_args:
      llm_model_name: '`replace_me:<class ''str''>`'
      tokenizer_mode: auto
      trust_remote_code: false
      download_dir: /path/to/.cache/sinapsis
      tensor_parallel_size: 1
      dtype: auto
      quantization: null
      seed: 0
      gpu_memory_utilization: 0.9
      max_num_seqs: 256
      max_model_len: null
      cpu_offload_gb: 0
      enforce_eager: false
      disable_log_stats: false
    completion_args:
      temperature: 0.2
      top_p: 0.95
      top_k: 40
      presence_penalty: 0.0
      frequency_penalty: 0.0
      repetition_penalty: 1.0
      min_p: 0.0
      seed: null
      stop: null
      ignore_eos: false
      max_tokens: 16
      min_tokens: 0
      bad_words: null
      response_format:
        type_: text
        schema_:
          properties: '`replace_me:dict[str, str | sinapsis_vllm.helpers.schemas.PropertyDefinition]`'
          required: '`replace_me:list[str]`'
    chat_history_key: null
    rag_context_key: null
    system_prompt: null
    pattern: null
    keep_before: true
    structured_output_key: structured_output

📚 Usage example

The following agent passes text messages through TextPackets and retrieves responses from an LLM
Config
agent:
  name: chat_completion
  description: Chatbot agent using Qwen

templates:
- template_name: InputTemplate
  class_name: InputTemplate
  attributes: {}

- template_name: TextInput
  class_name: TextInput
  template_input: InputTemplate
  attributes:
    text: what is AI?

- template_name: vLLMTextCompletion
  class_name: vLLMTextCompletion
  template_input: TextInput
  attributes:
    init_args:
      llm_model_name: Qwen/Qwen3-1.7B
      max_model_len: 4096
      dtype: auto
      seed: 42
      gpu_memory_utilization: 0.9
      cpu_offload_gb: 2
      max_num_seqs: 8
      disable_log_stats: true
    completion_args:
      max_tokens: 1024
      temperature: 0.7
      seed: 42
    system_prompt: 'You are a helpful AI assistant'

Multimodal Example

The following agent processes an image and generates a description using a vision-language model:
Multimodal Config
agent:
  name: multimodal_chatbot
  description: Agent with support for multimodal vLLM model for image-to-text

templates:
  - template_name: InputTemplate
    class_name: InputTemplate
    attributes: {}

  - template_name: FolderImageDatasetCV2
    class_name: FolderImageDatasetCV2
    template_input: InputTemplate
    attributes:
      load_on_init: True
      data_dir: "artifacts"
      pattern: "test.png"

  - template_name: TextInput
    class_name: TextInput
    template_input: FolderImageDatasetCV2
    attributes:
      text: "Describe what you see in the image."

  - template_name: vLLMMultiModal
    class_name: vLLMMultiModal
    template_input: TextInput
    attributes:
      init_args:
        llm_model_name: "Qwen/Qwen2-VL-2B-Instruct-AWQ"
        max_model_len: 1024
        dtype: auto
        quantization: awq
        seed: 42
        gpu_memory_utilization: 0.95
        max_num_seqs: 1
        disable_log_stats: true
        enforce_eager: true
        limit_mm_per_prompt:
          image: 1
      completion_args:
        temperature: 0.7
        top_p: 0.8
        top_k: 20
        min_p: 0
        max_tokens: 1024
      system_prompt: "You are a helpful vision-language assistant."

[!NOTE] This example uses an AWQ quantized model for lower GPU memory requirements. For GPUs with limited memory, consider using quantized models (AWQ, GPTQ) or increasing cpu_offload_gb.

🌐 Webapps

You can interact with vLLM models using the generic chatbot webapp. The webapp works with any config by setting the AGENT_CONFIG_PATH environment variable.

[!IMPORTANT] To run the app you first need to clone this repository:

git clone git@github.com:Sinapsis-ai/sinapsis-chatbots.git
cd sinapsis-chatbots

[!NOTE] If you'd like to enable external app sharing in Gradio, export GRADIO_SHARE_APP=True

🐳 Docker

IMPORTANT This docker image depends on the sinapsis-nvidia:base image. Please refer to the official sinapsis instructions to Build with Docker.

  1. Build the sinapsis-chatbots image:
docker compose -f docker/compose.yaml build
  1. Start the vLLM chatbot container:
docker compose -f docker/compose_apps.yaml up sinapsis-vllm-chatbot -d

Or for the multimodal variant with image upload support:

docker compose -f docker/compose_apps.yaml up sinapsis-vllm-multimodal-chatbot -d
  1. Check the logs:
docker logs -f sinapsis-vllm-chatbot
  1. The logs will display the URL to access the webapp, e.g.,::
Running on local URL:  http://127.0.0.1:7860

NOTE: The url may be different, check the output of logs.

  1. To stop the app:
docker compose -f docker/compose_apps.yaml down

To use a different model, update the AGENT_CONFIG_PATH environmental variable to point to the desired YAML file.

💻 UV

To run the webapp using the uv package manager, follow these steps:

  1. Sync the virtual environment:
uv sync --frozen
  1. Install the wheel:
uv pip install sinapsis-vllm[all] --extra-index-url https://pypi.sinapsis.tech
  1. Run the chatbot webapp with vLLM config:
export AGENT_CONFIG_PATH=webapps/configs/llama_cpp_simple_chatbot/vllm_text_completion.yaml
uv run webapps/llama_cpp_simple_chatbot.py

Or for multimodal (image upload support):

export AGENT_CONFIG_PATH=webapps/configs/llama_cpp_simple_chatbot/vllm_multimodal.yaml
uv run webapps/llama_cpp_simple_chatbot.py
  1. The terminal will display the URL to access the webapp, e.g.:
Running on local URL:  http://127.0.0.1:7860

NOTE: The URL may vary; check the terminal output for the correct address.

📙 Documentation

Documentation for this and other sinapsis packages is available on the sinapsis website

Tutorials for different projects within sinapsis are available at sinapsis tutorials page

🔍 License

This project is licensed under the AGPLv3 license, which encourages open collaboration and sharing. For more details, please refer to the LICENSE file.

For commercial use, please refer to our official Sinapsis website for information on obtaining a commercial license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sinapsis_vllm-0.1.1.tar.gz (32.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sinapsis_vllm-0.1.1-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file sinapsis_vllm-0.1.1.tar.gz.

File metadata

  • Download URL: sinapsis_vllm-0.1.1.tar.gz
  • Upload date:
  • Size: 32.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.17

File hashes

Hashes for sinapsis_vllm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d92ffec014c941ddfa0f5765715ccba3bc76629779d95164e3b5c26464b1ab19
MD5 131a915a112e698b30d81a13f598074a
BLAKE2b-256 cd19a436d9489dbd093c58835458d0a66eef8425ea0a4ecda0ee8e0407532984

See more details on using hashes here.

File details

Details for the file sinapsis_vllm-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for sinapsis_vllm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 255f0c2c520363d1c2cd6eaae448ab829b6485910e9ae7b318f953036eecb47f
MD5 500f2b0ec33e485bfc89d3ca91593180
BLAKE2b-256 0825a720c437117ca7c92bfebe3d392c890a523b205aab0bc131226dc82f0ab2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page