Skip to main content

Easy to use lightweight vllm tool with special support for Large Reasoning Model (LRM).

Project description

easyvllm

Easy to use lightweight vllm tool with special support for Large Reasoning Model (LRM).

| Documentation | Paper | Team Page | Apache License 2.0 |

Features

We have encapsulated vLLM and utilize the OpenAI client to call vllm serve, enabling multiple models to be loaded simultaneously in a multi-GPU environment when working with small models. This approach maximizes batch inference performance. Additionally, we support various LRM inference methods, including:

  • Direct model invocation
  • Inference length control
  • Customizable inference content

We also support multi-node Ray cluster integration.

Installation

Create a new Python environment

To use easyvllm, we support to create a new Python environment.

You can create a new Python environment using conda:

conda create -n easyvllm python=3.11 -y
conda activate easyvllm

Install easyvllm

You can install easyvllm from PyPI:

pip install easyvllm

or you can install in local:

git clone https://github.com/OpenRLHF/OpenRLHF.git
cd easyvllm
pip install -e .

Add Reasoning Parsers to Vllm

We implemented several reasoning parsers for vllm, including Openthinker and simplescaling.

To enable reasoning outputs, parsers need to be added in your local vllm package. You can add easily with following steps:

  1. Find the vllm package in your Python environment. It should be in the path like ${conda_path}/envs/{your_env_name}/lib/python3.xx/site-packages/vllm.

  2. Find reasoning_parsers folder. It should be in ${conda_path}/envs/{your_env_name}/lib/python3.xx/site-packages/vllm/entrypoints/openai/reasoning_parsers.

  3. Add from easyvllm.parsers import * in __init__.py file. After the modification, the content should be as follows:

# SPDX-License-Identifier: Apache-2.0

from .abs_reasoning_parsers import ReasoningParser, ReasoningParserManager
from .deepseek_r1_reasoning_parser import DeepSeekR1ReasoningParser
from easyvllm.parsers import *

__all__ = [
    "ReasoningParser", "ReasoningParserManager", "DeepSeekR1ReasoningParser"
]

Quick Start

InferenceModel

Using the encapsulated model is simple. Just import InferenceModel and provide the model path along with relevant parameters for initialization:

from easyvllm import InferenceModel  

model = InferenceModel(model_path="your_model_path")

If you are initializing a LRM, please set enable_reasoning flag and reasoning_parser:

from easyvllm import InferenceModel  

model = InferenceModel(model_path="your_LRM_path", enable_reasoning=True, reasoning_parser='reasoning_parser_name')

We support following reasoning parsers now:

  • deepseek_r1
  • openthinker
  • simplescaling

The deepseek_r1 parser is implemented by vllm and supports parsing QWQ inference outputs. The remaining parsers are located in the ./easyvllm/parsers directory and need to be added to the vLLM parser path to be utilized.

Once initialized, you can use the following methods for inference:

  • parallel_chat: Dialogue generation (supports LRM)
  • parallel_generate: Direct text generation (supports LRM)
  • parallel_chat_custom: Customizable dialogue generation (supports LRM)
  • parallel_chat_force_reasoning_content: Control reasoning content (for LRM)

These methods return a list of generated content in order. For LRM, they return a list of (inference content, inference result) tuples.

CLI

We provide a CLI interface for users to conveniently invoke the model for inference.

  • decode

You can use the easyvllm decode command for inference with the following basic format:

easyvllm decode --model_path your_model_path --file_path input.json --decode_type query --save_path output.json --query_keys prompt

This will read queries from input.json and save results to output.json, the key of query in input.json is prompt.

For reasoning length control, enable query_reasoning_ctrl and set reasoning_max_len:

easyvllm decode --model_path your_model_path --file_path input.json --decode_type query_reasoning_ctrl --save_path output.json --query_keys prompt --enable_reasoning True --enable_length_ctrl True --reasoning_max_len 500

For control reasoning content, specifying force_reasoning_content_keys:

easyvllm decode --model_path your_model_path --file_path input.json --decode_type query_force_reasoning_content --save_path output.json --query_keys prompt --enable_reasoning True --force_reasoning_content_keys reasoning

For more details on available parameters, refer to the decode parameters.

  • decode multask

We provide a multi-task decoding feature via the easyvllm decode multask command, allowing users to execute multiple inference tasks in parallel by specifying a YAML configuration file.

To run multiple inference tasks, use the following command:

easyvllm decode multask --model_path your_model_path --tasks_yaml_path tasks_config.yaml

Here is a simple example of tasks_config.yaml file:

- file_path: "input1.json"  
  decode_type: "query"  
  save_path: "output1.json"  
  query_keys: "input_text"  
  response_keys: "generated_text"  
  threads: 20  
  max_new_tokens: 8192  
  enable_reasoning: true  
  enable_length_ctrl: false  
  reasoning_max_len: 500   

- file_path: "input2.json"  
  decode_type: "query_reasoning_ctrl"  
  save_path: "output2.json"  
  query_keys: "input_text"  
  response_keys: "generated_text"  
  reasoning_keys: "reasoning_steps"  
  threads: 20  
  enable_reasoning: true  
  reasoning_max_retry: 10  
  force_reasoning_content_keys: "reasoning_text"  
  overwrite: true  

For more details on available parameters, refer to the decode multask parameters and task yaml configuration

Documentation

InferenceModel

init

  • Required Parameters

    • model_path (str): Path to the model.
  • Optional Parameters

    • device_ids (list[int], optional): List of CUDA device IDs. If specified, the model runs on selected devices. The number of models initialized is determined by len(device_ids) // tensor_parallel_size. Defaults to None.
    • tensor_parallel_size (int, optional): Tensor parallelism size for initializing the vLLM model. Defaults to 1.
    • pipeline_parallel_size (int, optional): Pipeline parallelism size for initializing the vLLM model. Defaults to 1.
    • port (int, optional): Base port for the vLLM service. To avoid conflicts when initializing multiple models, the actual port used is port + device_id. Defaults to 50000.
    • max_model_len (int, optional): Maximum model input length. Defaults to None.
    • show_vllm_log (bool, optional): Whether to display vLLM logs. Set to False to disable log output. Defaults to True.
    • openai_timeout (int, optional): Timeout value (in seconds) for OpenAI client operations. Increase if the reasoning content is too large. Defaults to 30.
    • enable_reasoning (bool, optional): Enable large reasoning model (LRM) support. Defaults to False.
    • reasoning_parser (str, optional): Parser used for processing reasoning content. Defaults to 'deepseek_r1'.
    • chat_template (str, optional): Path to a chat template file for initializing the vLLM model. Defaults to None.
    • use_ray (bool, optional): Enable distributed inference across multiple nodes using Ray. Requires a Ray cluster to be created first. Defaults to False.
    • ray_host_ip (str, optional): Host IP address of the running Ray cluster. Required when use_ray=True. Defaults to None.
    • enforce_eager (bool, optional): Set --enforce-eager for vllm. Defaults to False.
    • gpu_memory_utilization (float, optional): The --gpu-memory-utilization value of vllm. Defaults to 0.95.

parallel_chat

  • Required Parameters

    • messages_list (list[dict[str, str]]): A list of message dictionaries, where each dictionary contains:
      • role (str): The role of the speaker, must be one of "user", "assistant", or "system" (the "system" role is only allowed in the first turn).
      • content (str): The content of the message.
  • Optional Parameters

    • threads (int, default=20): Number of threads for parallel execution.
    • return_dict (bool, default=False): Whether to return responses as a list of dictionaries. If True, responses are returned as list[dict[str, str]].
    • reasoning_max_retry (int, default=10): Maximum number of retries for reasoning. Only relevant when using an LRM.
    • param (ChatParam, default=None): Additional chat parameters for customization.
    • ext_param (ChatExtraParam, default=None): Extended chat parameters for further customization.
  • Returns

    • list[str]: If the model is not an LRM, returns a list of generated responses.
    • list[tuple[str, str]]: If the model is an LRM, returns a list of tuples containing both the reasoning process and the final response.
    • list[dict[str, str]]: If return_dict=True, returns a list of dictionaries, where each dictionary contains:
      • response (str): The generated response.
      • reasoning (str, optional): The reasoning process (only available in LRM mode).

parallel_generate

  • Required Parameters

    • prompt_list (list[str]): A list of prompts, where each string is a prompt to generate a response for.
  • Optional Parameters

    • threads (int, optional, default=20): Number of threads for parallel execution.
    • return_dict (bool, optional, default=False): Whether to return responses as a list of dictionaries. If True, responses are returned as list[dict[str, str]], where each dictionary contains:
      • response (str): The generated response for the prompt.
    • param (GenParam, optional, default=None): Additional parameters for customization.
    • ext_param (GenExtraParam, optional, default=None): Extended parameters for further customization.
  • Returns

    • list[str]: A list of generated responses corresponding to each prompt in prompt_list.
    • list[dict[str, str]]: If return_dict=True, returns a list of dictionaries, where each dictionary contains:
      • response (str): The generated response for the prompt.

parallel_chat_custom

  • Required Parameters

    • messages_list (list[dict[str, str]]): A list of message dictionaries, where each dictionary contains:
      • role (str): The role of the speaker, must be one of "user", "assistant", or "system" (the "system" role is only allowed in the first turn).
      • content (str): The content of the message.
  • Optional Parameters

    • threads (int, default=20): Number of threads for parallel execution.
    • reasoning_max_retry (int, default=10): Maximum number of retries for reasoning. Only relevant when using an LRM.
    • add_reasoning_prompt (bool, default=False): Whether to explicitly add a reasoning prompt to guide the model.
    • enable_length_ctrl (bool, default=False): Whether to enable reasoning length control.
    • reasoning_max_len (int, default=None): Maximum length for the reasoning content. Ignored if reasoning_scale is set.
    • reasoning_min_len (int, default=0): Minimum length for the reasoning content. Ignored if reasoning_scale is set.
    • reasoning_scale (float, default=None): Scaling factor for reasoning content. If set, reasoning_max_len and reasoning_min_len are ignored, and the reasoning process is either truncated or extended accordingly.
    • cut_by_sentence (bool, default=False): Whether to truncate reasoning content at sentence boundaries when using reasoning_scale.
    • return_dict (bool, default=False): Whether to return responses as a list of dictionaries. If True, responses are returned as list[dict[str, str]].
    • param (GenParam, default=None): Additional chat parameters for customization.
    • ext_param (GenExtraParam, default=None): Extended chat parameters for further customization.
  • Returns

    • list[str]: If the model is not an LRM, returns a list of generated responses.
    • list[tuple[str, str]]: If the model is an LRM, returns a list of tuples containing both the reasoning process and the final response.
    • list[dict[str, str]]: If return_dict=True, returns a list of dictionaries, where each dictionary contains:
      • response (str): The generated response.
      • reasoning (str, optional): The reasoning process (only available in LRM mode).

parallel_chat_force_reasoning_content

  • Required Parameters

    • messages_list (list[dict[str, str]]): A list of message dictionaries, where each dictionary contains:
      • role (str): The role of the speaker, must be one of "user", "assistant", or "system" (the "system" role is only allowed in the first turn).
      • content (str): The content of the message.
    • reasoning_content (list[str]): A list of predefined reasoning processes corresponding to each input sequence. This ensures that the model follows the provided reasoning when generating responses.
  • Optional Parameters

    • threads (int, default=20): Number of threads for parallel execution.
    • reasoning_scale (float, default=None): Scaling factor for adjusting reasoning content length. If set, the reasoning content is truncated or extended accordingly.
    • cut_by_sentence (bool, default=False): Whether to truncate reasoning content at sentence boundaries when using reasoning_scale.
    • return_dict (bool, default=False): Whether to return responses as a list of dictionaries. If True, responses are returned as list[dict[str, str]].
    • param (GenParam, default=None): Additional chat parameters for customization.
    • ext_param (GenExtraParam, default=None): Extended chat parameters for further customization.
  • Returns

    • list[tuple[str, str]]: Returns a list of tuples, each containing:
      • reasoning (str): The provided reasoning process.
      • response (str): The generated response.
    • list[dict[str, str]]: If return_dict=True, returns a list of dictionaries, where each dictionary contains:
      • response (str): The generated response.
      • reasoning (str): The provided reasoning process.

CLI

decode

  • Required Parameters

    • --model_path: Path to the model
    • --file_path: Path to the input data file (support json, jsonl, csv and xlsx)
    • --decode_type: Decoding type, available options:
      • query: Standard query
      • query_reasoning_ctrl: Control reasoning length
      • query_force_reasoning_content: Control reasoning content
    • --save_path: Path to save the output results
    • --query_keys: Specify query fields (comma-separated)
  • Optional Parameters

    • --response_keys: Specify response save fields (comma-separated)
    • --reasoning_keys: Reasoning save fields (for reasoning mode, comma-separated)
    • --tensor_parallel_size: Tensor parallelism size for the model (default: 1)
    • --pipeline_parallel_size: Pipeline parallelism size for the model (default: 1)
    • --model_num: Number of models loaded simultaneously
    • --port: Server listening port (default: 50000)
    • --max_model_len: max_model_len of vllm model
    • --show_vllm_log: Whether to display vLLM logs (default: enabled)
    • --openai_timeout: Timeout for OpenAI client (default: 30 seconds)
    • --threads: Number of parallel threads (default: 20)
    • --enable_reasoning: Enable reasoning mode
    • --reasoning_parser: Reasoning parser name (default: deepseek_r1)
    • --system_prompt_file: Specify system prompt file
    • --chat_template_file: Specify chat template file
    • --max_new_tokens: Maximum number of new tokens to generate (default: 8192)
    • --device_ids: Specify GPU devices (comma-separated for multiple devices)
    • --reasoning_max_retry: Maximum number of retries for reasoning (default: 10)
    • --add_reasoning_prompt: Whether to add a reasoning prompt
    • --enable_length_ctrl: Enable reasoning length control
    • --reasoning_max_len: Maximum reasoning length
    • --reasoning_min_len: Minimum reasoning length (default: 0)
    • --reasoning_scale: Scaling factor for reasoning length
    • --cut_by_sentence: Whether to split reasoning by sentence
    • --force_reasoning_content_keys: Fields for enforcing reasoning content (comma-separated)
    • --overwrite: Whether to overwrite existing fields in input file
    • --use_ray: Enable distributed inference across multiple nodes using Ray. Requires a Ray cluster to be created first. (default: false)
    • --ray_host_ip: Host IP address of the running Ray cluster. Required when use_ray=True.
    • --enforce_eager: Set --enforce-eager for vllm. (default: false)
    • --gpu_memory_utilization: The --gpu-memory-utilization value of vllm. (default: 0.95)

Note: Set multiple query_keys for multi-round generation. If response_keys, reasoning_keys or/and force_reasoning_content_keys specified, they must have same length with query_keys. force_reasoning_content_keys must be specified when set decode_type to query_force_reasoning_content.

decode multask

  • Required Parameters
    • --model_path: Path to the model
    • --tasks_yaml_path: Path to a YAML configuration file specifying multiple inference tasks
  • Optional Parameters
    • --tensor_parallel_size: Tensor parallelism size for the model (default: 1)
    • --pipeline_parallel_size: Pipeline parallelism size for the model (default: 1)
    • --max_model_len: Maximum input length
    • --model_num: Number of models loaded simultaneously
    • --port: Server listening port (default: 50000)
    • --openai_timeout: Timeout for OpenAI client (default: 30 seconds)
    • --enable_reasoning: Enable reasoning mode
    • --chat_template_file: Specify chat template file
    • --reasoning_parser: Reasoning parser (default: deepseek_r1)
    • --show_vllm_log: Whether to display vLLM logs (default: enabled)
    • --device_ids: Specify GPU devices (comma-separated for multiple devices)
    • --use_ray: Enable distributed inference across multiple nodes using Ray. Requires a Ray cluster to be created first. (default: false)
    • --ray_host_ip: Host IP address of the running Ray cluster. Required when use_ray=True.
    • --enforce_eager: Set --enforce-eager for vllm. (default: false)
    • --gpu_memory_utilization: The --gpu-memory-utilization value of vllm. (default: 0.95)

Task YAML Configuration

  • Required Parameters
    • file_path: Path to the input file
    • decode_type: Decoding mode (query, query_reasoning_ctrl, query_force_reasoning_content)
    • save_path: Path to save the output file
    • query_keys: Key(s) in the input file used as queries
  • Optional Parameters
    • response_keys: Key(s) in the output file for generated responses
    • reasoning_keys: Key(s) storing intermediate reasoning steps
    • threads: Number of threads for processing (default: 20)
    • system_prompt_file: Specify system prompt file
    • max_new_tokens: Maximum number of new tokens to generate (default: 8192)
    • reasoning_max_retry: Maximum number of retries for reasoning (default: 10)
    • add_reasoning_prompt: Whether to add a reasoning prompt
    • enable_length_ctrl: Enable length control for responses (default: false)
    • reasoning_max_len: Maximum length for reasoning content
    • reasoning_min_len: Minimum reasoning length (default: 0)
    • reasoning_scale: Scaling factor for reasoning length
    • cut_by_sentence: Whether to split content by sentence (default: false)
    • overwrite: Whether to overwrite existing output files (default: true)
    • force_reasoning_content_keys: Key(s) to enforce reasoning content generation
    • overwrite: Whether to overwrite existing fields in input file

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easyvllm-0.1.0.tar.gz (28.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

easyvllm-0.1.0-py3-none-any.whl (34.6 kB view details)

Uploaded Python 3

File details

Details for the file easyvllm-0.1.0.tar.gz.

File metadata

  • Download URL: easyvllm-0.1.0.tar.gz
  • Upload date:
  • Size: 28.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for easyvllm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f51209d806c30d831839726fddbae452f5d3c0ef0371d0448946bbf9e526ae60
MD5 922ab7c13c2acff1f36dc221d082891b
BLAKE2b-256 2d436fd8ae992a7d1b0ef1e92e136674cd4038dccc441e0d88507a86e8fe6c44

See more details on using hashes here.

File details

Details for the file easyvllm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: easyvllm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 34.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for easyvllm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6a79fc6b6af403bbf96503631f4fa503dce63e854ff7fb67581bdc8bc6d53136
MD5 9c38f0a2f809160249725e34faa0dea8
BLAKE2b-256 1fb5ff571dc9387f41be3181b8f50520335923e9527bd1f4f523e268f30739df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page