A utility package for text generation using vLLM with multiprocessing support.
Project description
vlllm: High-Performance Text Generation with vLLM and Multiprocessing
vlllm
is a Python utility package designed to simplify and accelerate text generation tasks using the powerful vLLM library. It offers a convenient interface for batch processing, chat templating, multiple sampling strategies, and multi-GPU inference with tensor and pipeline parallelism, all wrapped in an easy-to-use generate
function with multiprocessing support.
Features
- Batch Processing: Efficiently process lists of prompts.
- Flexible Input: Supports both single string prompts and list-based chat message formats (e.g.,
[{"role": "user", "content": "Hello!"}]
). - System Prompts: Easily integrate system-level instructions.
- Multiple Samples (
n
): Generate multiple completions per prompt.- Input Duplication Strategy (
use_sample=False
): Duplicates input promptsn
times for generation. - vLLM Native Sampling (
use_sample=True
): Uses vLLM's internal sampling parameter (SamplingParams(n=n)
) for generatingn
completions.
- Input Duplication Strategy (
- Multiprocessing (
worker_num
): Distribute generation tasks across multiple CPU worker processes, each potentially managing its own vLLM instance and GPU(s). - Tensor Parallelism (
tp
orgpu_assignments
): Configure tensor parallelism for vLLM instances within each worker. - Pipeline Parallelism (
pp
): Supports vLLM's pipeline parallelism (requirespp > 1
and usesdistributed_executor_backend="ray"
). - Chunking (
chunk_size
): Control the maximum number of prompts processed by a vLLM engine in a single call, useful for managing memory and very large datasets. - Customizable Output: Specify the key under which results are stored.
- Robust GPU Management: Automatic or manual assignment of GPUs to workers.
Installation
pip install vlllm
Quick Start
from vlllm import generate
# Example data with string prompts
data = [
{"prompt": "Write a story about a dragon"},
{"prompt": "Explain quantum computing"}
] * 1000
# Basic usage
results = generate(
model_id="meta-llama/Llama-2-7b-chat-hf",
data=data,
worker_num=2, # Use 2 worker processes
tp=1, # 1 GPU per worker
)
# Each item in results will have a new 'results' field with the generated text
print(results[0]["results"])
Parameters
Core Parameters
model_id
(str): Model identifier or path to loaddata
(List[Dict]): List of dictionaries containing prompts/messagesmessage_key
(str, default: "prompt"): Key in each dictionary containing the prompt or messagessystem
(str, optional): Global system prompt to prepend to all messagesresult_key
(str, default: "results"): Key name for storing generation results
Message Format Handling
The package intelligently handles different input formats:
- String format: If
data[i][message_key]
is a string, it's automatically converted to a chat message format - List format: If
data[i][message_key]
is a list, it's treated as a chat conversation with roles and content
When a system prompt is provided:
- For string inputs: Creates a message list with system and user messages
- For list inputs: Prepends the system message (unless one already exists)
Generation Parameters
n
(int, default: 1): Number of samples to generate per promptuse_sample
(bool, default: False):- If
False
: Duplicates each promptn
times in the generation list - If
True
: Uses vLLM's nativeSamplingParams(n=n)
for efficient sampling
- If
temperature
(float, default: 0.7): Sampling temperaturemax_output_len
(int, default: 1024): Maximum tokens to generate per sample
Result Format
- If
n=1
: Theresult_key
field contains a single string - If
n>1
: Theresult_key
field contains a list of strings
Parallelization Parameters
worker_num
(int, default: 1): Number of worker processes- If 1: Single process execution
- If >1: Multi-process execution with data evenly distributed
tp
(int, default: 1): Tensor parallel size per workerpp
(int, default: 1): Pipeline parallel size- If >1: Uses Ray distributed backend (requires
worker_num=1
)
- If >1: Uses Ray distributed backend (requires
gpu_assignments
(List[List[int]], optional): Custom GPU assignments per worker
Performance Parameters
chunk_size
(int, optional): Maximum items per generation batch- If not set: Each worker processes its entire partition at once
- If set: Data is processed in chunks of this size
max_model_len
(int, default: 4096): Maximum model sequence lengthgpu_memory_utilization
(float, default: 0.90): Target GPU memory usagedtype
(str, default: "auto"): Model data typetrust_remote_code
(bool, default: True): Whether to trust remote code
Advanced Usage
Chat Format with Multiple Samples
# Data with chat message format
data = [
{
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}
] * 100
# Generate 3 different responses per prompt
results = generate(
model_id="meta-llama/Llama-2-7b-chat-hf",
data=data,
message_key="messages", # Specify the key containing messages
system="You are a helpful assistant.", # Global system prompt
n=3, # Generate 3 samples
use_sample=True, # Use vLLM's native sampling
temperature=0.8,
worker_num=2,
tp=2 # Use 2 GPUs per worker
)
# results[0]["results"] will be a list of 3 different responses
for i, response in enumerate(results[0]["results"]):
print(f"Response {i+1}: {response}")
Processing Large Datasets with Chunking
# Large dataset
data = [{"prompt": f"Question {i}"} for i in range(10000)]
results = generate(
model_id="meta-llama/Llama-2-7b-chat-hf",
data=data,
worker_num=4,
chunk_size=100, # Process in chunks of 100 items
tp=1,
max_output_len=512
)
Custom GPU Assignment
# Assign specific GPUs to each worker
results = generate(
model_id="meta-llama/Llama-2-7b-chat-hf",
data=data,
worker_num=2,
gpu_assignments=[[0, 1], [2, 3]], # Worker 0 uses GPU 0,1; Worker 1 uses GPU 2,3
)
Pipeline Parallelism
# Use pipeline parallelism (requires worker_num=1)
results = generate(
model_id="meta-llama/Llama-2-70b-chat-hf",
data=data,
worker_num=1,
pp=4, # 4-way pipeline parallelism
tp=2, # 2-way tensor parallelism
)
Important Notes
- Pipeline Parallelism: When
pp > 1
,worker_num
must be 1 - GPU Requirements: Total GPUs needed =
worker_num * tp
(when not using custom assignments) - Memory Management: The package automatically handles memory cleanup between batches
- Error Handling: Failed generations are marked with error messages in the results
- Process Safety: Uses spawn method for multiprocessing on POSIX systems
Example: Batch Processing Pipeline
from vlllm import generate
import json
# Load your dataset
with open("questions.jsonl", "r") as f:
data = [json.loads(line) for line in f]
# Configure generation
results = generate(
model_id="meta-llama/Llama-2-13b-chat-hf",
data=data,
message_key="question", # Your data has questions in 'question' field
system="Answer concisely and accurately.",
n=1,
temperature=0.1, # Low temperature for consistency
worker_num=4, # 4 parallel workers
tp=2, # 2 GPUs per worker
chunk_size=50, # Process 50 items at a time
max_output_len=256,
result_key="answer" # Store results in 'answer' field
)
# Save results
with open("answers.jsonl", "w") as f:
for item in results:
f.write(json.dumps(item) + "\n")
Requirements
- Python >= 3.8
- vLLM
- PyTorch
- Transformers
- CUDA-capable GPUs (for GPU acceleration)
License
Apache-2.0 License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file vlllm-0.2.2.tar.gz
.
File metadata
- Download URL: vlllm-0.2.2.tar.gz
- Upload date:
- Size: 17.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
c842b7a22443e7db9baca5ea17b62bda32c1119ebd2949744d0036bef96b6d88
|
|
MD5 |
e935aaa2446a1279f63029b3948ecedf
|
|
BLAKE2b-256 |
d1478a679a10628ce44875083f03ba2944a94dfdf210d33bfd1160e121d19ec8
|
File details
Details for the file vlllm-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: vlllm-0.2.2-py3-none-any.whl
- Upload date:
- Size: 14.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
dc7421407d0f89474f8ad97672000b701c7e6c5a2dc1eb22a36f47a23f3a5651
|
|
MD5 |
eccb49e612d0378e1e701c685a03f385
|
|
BLAKE2b-256 |
75e42b7c2c69be81ae9b4e1869f64ca0f0e80b589d8caffcd77862f8aa94df8a
|