Easy to use lightweight vllm tool with special support for Large Reasoning Model (LRM).
Project description
easyvllm
Easy to use lightweight vllm tool with special support for Large Reasoning Model (LRM).
| Documentation | Paper | Team Page | Apache License 2.0 |
Features
We have encapsulated vLLM and utilize the OpenAI client to call vllm serve, enabling multiple models to be loaded simultaneously in a multi-GPU environment when working with small models. This approach maximizes batch inference performance. Additionally, we support various LRM inference methods, including:
- Direct model invocation
- Inference length control
- Customizable inference content
We also support multi-node Ray cluster integration.
Installation
Create a new Python environment
To use easyvllm, we support to create a new Python environment.
You can create a new Python environment using conda:
conda create -n easyvllm python=3.11 -y
conda activate easyvllm
Install easyvllm
You can install easyvllm from PyPI:
pip install easyvllm
or you can install in local:
git clone https://github.com/OpenRLHF/OpenRLHF.git
cd easyvllm
pip install -e .
Add Reasoning Parsers to Vllm
We implemented several reasoning parsers for vllm, including Openthinker and simplescaling.
To enable reasoning outputs, parsers need to be added in your local vllm package. You can add easily with following steps:
-
Find the vllm package in your Python environment. It should be in the path like
${conda_path}/envs/{your_env_name}/lib/python3.xx/site-packages/vllm. -
Find
reasoning_parsersfolder. It should be in${conda_path}/envs/{your_env_name}/lib/python3.xx/site-packages/vllm/entrypoints/openai/reasoning_parsers. -
Add
from easyvllm.parsers import *in__init__.pyfile. After the modification, the content should be as follows:
# SPDX-License-Identifier: Apache-2.0
from .abs_reasoning_parsers import ReasoningParser, ReasoningParserManager
from .deepseek_r1_reasoning_parser import DeepSeekR1ReasoningParser
from easyvllm.parsers import *
__all__ = [
"ReasoningParser", "ReasoningParserManager", "DeepSeekR1ReasoningParser"
]
Quick Start
InferenceModel
Using the encapsulated model is simple. Just import InferenceModel and provide the model path along with relevant parameters for initialization:
from easyvllm import InferenceModel
model = InferenceModel(model_path="your_model_path")
If you are initializing a LRM, please set enable_reasoning flag and reasoning_parser:
from easyvllm import InferenceModel
model = InferenceModel(model_path="your_LRM_path", enable_reasoning=True, reasoning_parser='reasoning_parser_name')
We support following reasoning parsers now:
deepseek_r1openthinkersimplescaling
The deepseek_r1 parser is implemented by vllm and supports parsing QWQ inference outputs. The remaining parsers are located in the ./easyvllm/parsers directory and need to be added to the vLLM parser path to be utilized.
Once initialized, you can use the following methods for inference:
parallel_chat: Dialogue generation (supports LRM)parallel_generate: Direct text generation (supports LRM)parallel_chat_custom: Customizable dialogue generation (supports LRM)parallel_chat_force_reasoning_content: Control reasoning content (for LRM)
These methods return a list of generated content in order. For LRM, they return a list of (inference content, inference result) tuples.
CLI
We provide a CLI interface for users to conveniently invoke the model for inference.
- decode
You can use the easyvllm decode command for inference with the following basic format:
easyvllm decode --model_path your_model_path --file_path input.json --decode_type query --save_path output.json --query_keys prompt
This will read queries from input.json and save results to output.json, the key of query in input.json is prompt.
For reasoning length control, enable query_reasoning_ctrl and set reasoning_max_len:
easyvllm decode --model_path your_model_path --file_path input.json --decode_type query_reasoning_ctrl --save_path output.json --query_keys prompt --enable_reasoning True --enable_length_ctrl True --reasoning_max_len 500
For control reasoning content, specifying force_reasoning_content_keys:
easyvllm decode --model_path your_model_path --file_path input.json --decode_type query_force_reasoning_content --save_path output.json --query_keys prompt --enable_reasoning True --force_reasoning_content_keys reasoning
For more details on available parameters, refer to the decode parameters.
- decode multask
We provide a multi-task decoding feature via the easyvllm decode multask command, allowing users to execute multiple inference tasks in parallel by specifying a YAML configuration file.
To run multiple inference tasks, use the following command:
easyvllm decode multask --model_path your_model_path --tasks_yaml_path tasks_config.yaml
Here is a simple example of tasks_config.yaml file:
- file_path: "input1.json"
decode_type: "query"
save_path: "output1.json"
query_keys: "input_text"
response_keys: "generated_text"
threads: 20
max_new_tokens: 8192
enable_reasoning: true
enable_length_ctrl: false
reasoning_max_len: 500
- file_path: "input2.json"
decode_type: "query_reasoning_ctrl"
save_path: "output2.json"
query_keys: "input_text"
response_keys: "generated_text"
reasoning_keys: "reasoning_steps"
threads: 20
enable_reasoning: true
reasoning_max_retry: 10
force_reasoning_content_keys: "reasoning_text"
overwrite: true
For more details on available parameters, refer to the decode multask parameters and task yaml configuration
Documentation
InferenceModel
init
-
Required Parameters
model_path (str): Path to the model.
-
Optional Parameters
device_ids (list[int], optional): List of CUDA device IDs. If specified, the model runs on selected devices. The number of models initialized is determined bylen(device_ids) // tensor_parallel_size. Defaults toNone.tensor_parallel_size (int, optional): Tensor parallelism size for initializing the vLLM model. Defaults to1.pipeline_parallel_size (int, optional): Pipeline parallelism size for initializing the vLLM model. Defaults to1.port (int, optional): Base port for the vLLM service. To avoid conflicts when initializing multiple models, the actual port used isport + device_id. Defaults to50000.max_model_len (int, optional): Maximum model input length. Defaults toNone.show_vllm_log (bool, optional): Whether to display vLLM logs. Set toFalseto disable log output. Defaults toTrue.openai_timeout (int, optional): Timeout value (in seconds) for OpenAI client operations. Increase if the reasoning content is too large. Defaults to30.enable_reasoning (bool, optional): Enable large reasoning model (LRM) support. Defaults toFalse.reasoning_parser (str, optional): Parser used for processing reasoning content. Defaults to'deepseek_r1'.chat_template (str, optional): Path to a chat template file for initializing the vLLM model. Defaults toNone.use_ray (bool, optional): Enable distributed inference across multiple nodes using Ray. Requires a Ray cluster to be created first. Defaults toFalse.ray_host_ip (str, optional): Host IP address of the running Ray cluster. Required whenuse_ray=True. Defaults toNone.enforce_eager (bool, optional): Set--enforce-eagerfor vllm. Defaults toFalse.gpu_memory_utilization (float, optional): The--gpu-memory-utilizationvalue of vllm. Defaults to0.95.
parallel_chat
-
Required Parameters
messages_list (list[dict[str, str]]): A list of message dictionaries, where each dictionary contains:role (str): The role of the speaker, must be one of"user","assistant", or"system"(the"system"role is only allowed in the first turn).content (str): The content of the message.
-
Optional Parameters
threads (int, default=20): Number of threads for parallel execution.return_dict (bool, default=False): Whether to return responses as a list of dictionaries. IfTrue, responses are returned aslist[dict[str, str]].reasoning_max_retry (int, default=10): Maximum number of retries for reasoning. Only relevant when using an LRM.param (ChatParam, default=None): Additional chat parameters for customization.ext_param (ChatExtraParam, default=None): Extended chat parameters for further customization.
-
Returns
list[str]: If the model is not an LRM, returns a list of generated responses.list[tuple[str, str]]: If the model is an LRM, returns a list of tuples containing both the reasoning process and the final response.list[dict[str, str]]: Ifreturn_dict=True, returns a list of dictionaries, where each dictionary contains:response (str): The generated response.reasoning (str, optional): The reasoning process (only available in LRM mode).
parallel_generate
-
Required Parameters
prompt_list (list[str]): A list of prompts, where each string is a prompt to generate a response for.
-
Optional Parameters
threads (int, optional, default=20): Number of threads for parallel execution.return_dict (bool, optional, default=False): Whether to return responses as a list of dictionaries. IfTrue, responses are returned aslist[dict[str, str]], where each dictionary contains:response (str): The generated response for the prompt.
param (GenParam, optional, default=None): Additional parameters for customization.ext_param (GenExtraParam, optional, default=None): Extended parameters for further customization.
-
Returns
list[str]: A list of generated responses corresponding to each prompt inprompt_list.list[dict[str, str]]: Ifreturn_dict=True, returns a list of dictionaries, where each dictionary contains:response (str): The generated response for the prompt.
parallel_chat_custom
-
Required Parameters
messages_list (list[dict[str, str]]): A list of message dictionaries, where each dictionary contains:role (str): The role of the speaker, must be one of"user","assistant", or"system"(the"system"role is only allowed in the first turn).content (str): The content of the message.
-
Optional Parameters
threads (int, default=20): Number of threads for parallel execution.reasoning_max_retry (int, default=10): Maximum number of retries for reasoning. Only relevant when using an LRM.add_reasoning_prompt (bool, default=False): Whether to explicitly add a reasoning prompt to guide the model.enable_length_ctrl (bool, default=False): Whether to enable reasoning length control.reasoning_max_len (int, default=None): Maximum length for the reasoning content. Ignored ifreasoning_scaleis set.reasoning_min_len (int, default=0): Minimum length for the reasoning content. Ignored ifreasoning_scaleis set.reasoning_scale (float, default=None): Scaling factor for reasoning content. If set,reasoning_max_lenandreasoning_min_lenare ignored, and the reasoning process is either truncated or extended accordingly.cut_by_sentence (bool, default=False): Whether to truncate reasoning content at sentence boundaries when usingreasoning_scale.return_dict (bool, default=False): Whether to return responses as a list of dictionaries. IfTrue, responses are returned aslist[dict[str, str]].param (GenParam, default=None): Additional chat parameters for customization.ext_param (GenExtraParam, default=None): Extended chat parameters for further customization.
-
Returns
list[str]: If the model is not an LRM, returns a list of generated responses.list[tuple[str, str]]: If the model is an LRM, returns a list of tuples containing both the reasoning process and the final response.list[dict[str, str]]: Ifreturn_dict=True, returns a list of dictionaries, where each dictionary contains:response (str): The generated response.reasoning (str, optional): The reasoning process (only available in LRM mode).
parallel_chat_force_reasoning_content
-
Required Parameters
messages_list (list[dict[str, str]]): A list of message dictionaries, where each dictionary contains:role (str): The role of the speaker, must be one of"user","assistant", or"system"(the"system"role is only allowed in the first turn).content (str): The content of the message.
reasoning_content (list[str]): A list of predefined reasoning processes corresponding to each input sequence. This ensures that the model follows the provided reasoning when generating responses.
-
Optional Parameters
threads (int, default=20): Number of threads for parallel execution.reasoning_scale (float, default=None): Scaling factor for adjusting reasoning content length. If set, the reasoning content is truncated or extended accordingly.cut_by_sentence (bool, default=False): Whether to truncate reasoning content at sentence boundaries when usingreasoning_scale.return_dict (bool, default=False): Whether to return responses as a list of dictionaries. IfTrue, responses are returned aslist[dict[str, str]].param (GenParam, default=None): Additional chat parameters for customization.ext_param (GenExtraParam, default=None): Extended chat parameters for further customization.
-
Returns
list[tuple[str, str]]: Returns a list of tuples, each containing:reasoning (str): The provided reasoning process.response (str): The generated response.
list[dict[str, str]]: Ifreturn_dict=True, returns a list of dictionaries, where each dictionary contains:response (str): The generated response.reasoning (str): The provided reasoning process.
CLI
decode
-
Required Parameters
--model_path: Path to the model--file_path: Path to the input data file (support json, jsonl, csv and xlsx)--decode_type: Decoding type, available options:query: Standard queryquery_reasoning_ctrl: Control reasoning lengthquery_force_reasoning_content: Control reasoning content
--save_path: Path to save the output results--query_keys: Specify query fields (comma-separated)
-
Optional Parameters
--response_keys: Specify response save fields (comma-separated)--reasoning_keys: Reasoning save fields (for reasoning mode, comma-separated)--tensor_parallel_size: Tensor parallelism size for the model (default: 1)--pipeline_parallel_size: Pipeline parallelism size for the model (default: 1)--model_num: Number of models loaded simultaneously--port: Server listening port (default: 50000)--max_model_len: max_model_len of vllm model--show_vllm_log: Whether to display vLLM logs (default: enabled)--openai_timeout: Timeout for OpenAI client (default: 30 seconds)--threads: Number of parallel threads (default: 20)--enable_reasoning: Enable reasoning mode--reasoning_parser: Reasoning parser name (default:deepseek_r1)--system_prompt_file: Specify system prompt file--chat_template_file: Specify chat template file--max_new_tokens: Maximum number of new tokens to generate (default: 8192)--device_ids: Specify GPU devices (comma-separated for multiple devices)--reasoning_max_retry: Maximum number of retries for reasoning (default: 10)--add_reasoning_prompt: Whether to add a reasoning prompt--enable_length_ctrl: Enable reasoning length control--reasoning_max_len: Maximum reasoning length--reasoning_min_len: Minimum reasoning length (default: 0)--reasoning_scale: Scaling factor for reasoning length--cut_by_sentence: Whether to split reasoning by sentence--force_reasoning_content_keys: Fields for enforcing reasoning content (comma-separated)--overwrite: Whether to overwrite existing fields in input file--use_ray: Enable distributed inference across multiple nodes using Ray. Requires a Ray cluster to be created first. (default: false)--ray_host_ip: Host IP address of the running Ray cluster. Required whenuse_ray=True.--enforce_eager: Set--enforce-eagerfor vllm. (default: false)--gpu_memory_utilization: The--gpu-memory-utilizationvalue of vllm. (default: 0.95)
Note: Set multiple
query_keysfor multi-round generation. Ifresponse_keys,reasoning_keysor/andforce_reasoning_content_keysspecified, they must have same length withquery_keys.force_reasoning_content_keysmust be specified when setdecode_typetoquery_force_reasoning_content.
decode multask
- Required Parameters
--model_path: Path to the model--tasks_yaml_path: Path to a YAML configuration file specifying multiple inference tasks
- Optional Parameters
--tensor_parallel_size: Tensor parallelism size for the model (default: 1)--pipeline_parallel_size: Pipeline parallelism size for the model (default: 1)--max_model_len: Maximum input length--model_num: Number of models loaded simultaneously--port: Server listening port (default: 50000)--openai_timeout: Timeout for OpenAI client (default: 30 seconds)--enable_reasoning: Enable reasoning mode--chat_template_file: Specify chat template file--reasoning_parser: Reasoning parser (default:deepseek_r1)--show_vllm_log: Whether to display vLLM logs (default: enabled)--device_ids: Specify GPU devices (comma-separated for multiple devices)--use_ray: Enable distributed inference across multiple nodes using Ray. Requires a Ray cluster to be created first. (default: false)--ray_host_ip: Host IP address of the running Ray cluster. Required whenuse_ray=True.--enforce_eager: Set--enforce-eagerfor vllm. (default: false)--gpu_memory_utilization: The--gpu-memory-utilizationvalue of vllm. (default: 0.95)
Task YAML Configuration
- Required Parameters
file_path: Path to the input filedecode_type: Decoding mode (query,query_reasoning_ctrl,query_force_reasoning_content)save_path: Path to save the output filequery_keys: Key(s) in the input file used as queries
- Optional Parameters
response_keys: Key(s) in the output file for generated responsesreasoning_keys: Key(s) storing intermediate reasoning stepsthreads: Number of threads for processing (default: 20)system_prompt_file: Specify system prompt filemax_new_tokens: Maximum number of new tokens to generate (default: 8192)reasoning_max_retry: Maximum number of retries for reasoning (default: 10)add_reasoning_prompt: Whether to add a reasoning promptenable_length_ctrl: Enable length control for responses (default: false)reasoning_max_len: Maximum length for reasoning contentreasoning_min_len: Minimum reasoning length (default: 0)reasoning_scale: Scaling factor for reasoning lengthcut_by_sentence: Whether to split content by sentence (default: false)overwrite: Whether to overwrite existing output files (default: true)force_reasoning_content_keys: Key(s) to enforce reasoning content generationoverwrite: Whether to overwrite existing fields in input file
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file easyvllm-0.1.0.tar.gz.
File metadata
- Download URL: easyvllm-0.1.0.tar.gz
- Upload date:
- Size: 28.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f51209d806c30d831839726fddbae452f5d3c0ef0371d0448946bbf9e526ae60
|
|
| MD5 |
922ab7c13c2acff1f36dc221d082891b
|
|
| BLAKE2b-256 |
2d436fd8ae992a7d1b0ef1e92e136674cd4038dccc441e0d88507a86e8fe6c44
|
File details
Details for the file easyvllm-0.1.0-py3-none-any.whl.
File metadata
- Download URL: easyvllm-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a79fc6b6af403bbf96503631f4fa503dce63e854ff7fb67581bdc8bc6d53136
|
|
| MD5 |
9c38f0a2f809160249725e34faa0dea8
|
|
| BLAKE2b-256 |
1fb5ff571dc9387f41be3181b8f50520335923e9527bd1f4f523e268f30739df
|