A unified interface for efficient LLM inference with vLLM and OpenAI-compatible APIs
Project description
vLLM Efficient Client
A unified Python package for efficient Large Language Model (LLM) inference, supporting both:
- vLLM offline inference - High-throughput batch inference on local GPUs
- OpenAI-compatible APIs - Remote inference with automatic retry and resume
Features
vLLM Client (Offline Inference)
- 🚀 High-throughput batch inference using vLLM
- 🎯 Direct token feeding for optimal performance
- 💾 Automatic GPU memory management
- 🔄 Model switching without restarting
- 📝 HuggingFace chat template support
OpenAI Client (API Inference)
- 🌐 Works with any OpenAI-compatible API (OpenAI, OpenRouter, DeepSeek, Anthropic, etc.)
- 🔁 Automatic retry with exponential backoff
- 💾 Resume from checkpoint (auto-saves progress)
- 🛡️ Graceful quota exhaustion handling
- 🎯 Provider-specific parameter optimization
Installation
Basic Installation
pip install vllm-efficient-client
With vLLM Support (for offline inference)
pip install vllm-efficient-client[vllm]
With OpenAI Support (for API inference)
pip install vllm-efficient-client[openai]
With All Features
pip install vllm-efficient-client[all]
Development Installation
git clone https://github.com/yourusername/vllm-efficient-client.git
cd vllm-efficient-client
pip install -e .[dev]
Quick Start
Using vLLM Client (Offline Inference)
from vllm_efficient_client import VLLMClient, VLLMResourceConfig, SamplingConfig
# Configure vLLM resources
config = VLLMResourceConfig(
gpu_memory_utilization=0.9,
max_model_len=4096,
max_num_seqs=128,
max_num_batched_tokens=65536,
block_size=16,
tensor_parallel_size=1,
dtype="bfloat16",
trust_remote_code=True,
disable_log_stats=True,
)
# Optional: Auto-scale for model size
config.scale_for_model_size(3) # For 3B parameter model
# Initialize client
client = VLLMClient("meta-llama/Llama-3.2-3B-Instruct", config)
# Prepare prompts
prompts = [
{
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"metadata": {"id": 1, "category": "geography"}
},
{
"messages": [
{"role": "user", "content": "Explain quantum computing."}
],
"metadata": {"id": 2, "category": "science"}
}
]
# Run inference
results = client.run_batch(
prompts,
SamplingConfig(temperature=0.7, max_tokens=100)
)
# Results include metadata + generated output
for result in results:
print(f"ID: {result['id']}")
print(f"Output: {result['output']}")
print()
# Clean up
client.delete_client()
Using OpenAI Client (API Inference)
from vllm_efficient_client import OpenAIClient, OpenAIConfig, SamplingConfig
# Configure API client
config = OpenAIConfig(
api_key="your-api-key",
base_url="https://api.openai.com/v1/", # or OpenRouter, DeepSeek, etc.
enable_retry=True,
max_retries=5,
)
# Initialize client
client = OpenAIClient("gpt-4", config)
# Prepare prompts
prompts = [
{
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"metadata": {"qid": 1, "variant": "base", "seed": 42}
}
]
# Run inference with auto-save (enables resume)
results = client.run_batch(
prompts,
SamplingConfig(temperature=0.7, max_tokens=100, seed=42),
output_path="results.json" # Auto-saves for resume on failure
)
# If interrupted, just run again - it will resume from checkpoint!
Advanced Usage
Switching Models (vLLM)
client = VLLMClient("model-1", config)
# ... do some work ...
# Switch to a different model
client.reset_client_to_another_model("model-2")
# ... continue with new model ...
Multiple Completions (OpenAI)
# Generate 5 different responses per prompt
results = client.run_batch(
prompts,
SamplingConfig(temperature=0.8, max_tokens=100, n=5),
output_path="results.json"
)
# Each result contains a list of 5 outputs
for result in results:
print(f"Generated {len(result['output'])} responses")
Custom Resource Scaling (vLLM)
config = VLLMResourceConfig(
gpu_memory_utilization=0.9,
max_model_len=8192,
max_num_seqs=256,
max_num_batched_tokens=131072,
block_size=32,
tensor_parallel_size=2, # Use 2 GPUs
dtype="float16",
trust_remote_code=False,
disable_log_stats=True,
enable_prefix_caching=True,
)
# Automatically adjust for a 70B model
config.scale_for_model_size(70)
Configuration Options
VLLMResourceConfig
| Parameter | Type | Description |
|---|---|---|
gpu_memory_utilization |
float | Fraction of GPU memory to use (0.0-1.0) |
max_model_len |
int | Maximum sequence length |
max_num_seqs |
int | Maximum sequences per iteration |
max_num_batched_tokens |
int | Maximum tokens in a batch |
block_size |
int | Token block size for paged attention |
tensor_parallel_size |
int | Number of GPUs for tensor parallelism |
dtype |
str | Data type ("float16", "bfloat16", "float32") |
trust_remote_code |
bool | Trust remote code from model hub |
enable_prefix_caching |
bool | Enable KV cache prefix caching |
OpenAIConfig
| Parameter | Type | Description |
|---|---|---|
api_key |
str | API key for authentication |
base_url |
str | Base URL for API endpoint |
enable_retry |
bool | Enable automatic retry on errors |
max_retries |
int | Maximum retry attempts |
initial_retry_delay |
float | Initial delay for exponential backoff (seconds) |
max_retry_delay |
float | Maximum delay for exponential backoff (seconds) |
SamplingConfig
| Parameter | Type | Default | Description |
|---|---|---|---|
temperature |
float | 0.0 | Sampling temperature (0.0 = deterministic) |
top_p |
float | 1.0 | Nucleus sampling threshold |
max_tokens |
int | 2 | Maximum tokens to generate |
n |
int | 1 | Number of completions (OpenAI only) |
seed |
int | None | Random seed for reproducibility |
Supported Providers (OpenAI Client)
The OpenAI client automatically adapts to different providers:
- OpenAI (api.openai.com)
- OpenRouter (openrouter.ai)
- DeepSeek (deepseek.com)
- Anthropic (anthropic.com)
- Google Gemini (google AI)
- Local vLLM servers with OpenAI API compatibility
Project Structure
vllm_efficient_client/
├── src/
│ └── vllm_efficient_client/
│ ├── __init__.py # Main exports
│ ├── base.py # Base classes and interfaces
│ ├── vllm_client.py # vLLM offline inference client
│ └── openai_client.py # OpenAI API client
├── pyproject.toml # Package configuration
├── README.md # This file
└── LICENSE # MIT License
Best Practices
For vLLM (Offline Inference)
- Use
scale_for_model_size()to automatically adjust parameters - Set
dtype="bfloat16"for better performance on modern GPUs - Enable
prefix_caching=Truefor repeated prefixes - Use
tensor_parallel_size > 1for large models (>30B params) - Always call
delete_client()to free GPU memory
For OpenAI (API Inference)
- Always set
output_pathto enable resume on failure - Use
enable_retry=Truefor production workloads - Include
qid,variant, andseedin metadata for proper resume - Monitor logs for quota exhaustion warnings
- Set appropriate
max_retriesbased on your rate limits
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use this package in your research, please cite:
@software{vllm_efficient_client,
title = {vLLM Efficient Client: Unified Interface for LLM Inference},
author = {Your Name},
year = {2025},
url = {https://github.com/yourusername/vllm-efficient-client}
}
Acknowledgments
- Built on top of vLLM
- Uses OpenAI Python SDK
- Inspired by the need for a unified LLM inference interface
Support
- 📧 Email: your.email@example.com
- 🐛 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_efficient_client-0.1.0.tar.gz.
File metadata
- Download URL: vllm_efficient_client-0.1.0.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
45e36d1417317f360dab72becdc300e318b00d07e49b378efe4c873386ffbf78
|
|
| MD5 |
9afaf26617efa71345ac8df2e232d24d
|
|
| BLAKE2b-256 |
d4d7448619d393b7e99fbdfb6882e1aad72abca18c1d7cb19cd4c83a176a7e50
|
File details
Details for the file vllm_efficient_client-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vllm_efficient_client-0.1.0-py3-none-any.whl
- Upload date:
- Size: 14.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7330033fa401794cfae062de3193a087d27459fa2eaa1909e25c42982ee345d5
|
|
| MD5 |
cb86192a4a37ff2a37a413fa18320301
|
|
| BLAKE2b-256 |
de6befa71c67a104207a28da5d3736bf5201d12d798ded361a9869fd23b0b299
|