A unified interface for efficient LLM inference with vLLM and OpenAI-compatible APIs

These details have not been verified by PyPI

Project links

Project description

vLLM Efficient Client

A unified Python package for efficient Large Language Model (LLM) inference, supporting both:

vLLM offline inference - High-throughput batch inference on local GPUs
OpenAI-compatible APIs - Remote inference with automatic retry and resume

Features

vLLM Client (Offline Inference)

🚀 High-throughput batch inference using vLLM
🎯 Direct token feeding for optimal performance
💾 Automatic GPU memory management
🔄 Model switching without restarting
📝 HuggingFace chat template support

OpenAI Client (API Inference)

🌐 Works with any OpenAI-compatible API (OpenAI, OpenRouter, DeepSeek, Anthropic, etc.)
🔁 Automatic retry with exponential backoff
💾 Resume from checkpoint (auto-saves progress)
🛡️ Graceful quota exhaustion handling
🎯 Provider-specific parameter optimization

Installation

Basic Installation

pip install vllm-efficient-client

With vLLM Support (for offline inference)

pip install vllm-efficient-client[vllm]

With OpenAI Support (for API inference)

pip install vllm-efficient-client[openai]

With All Features

pip install vllm-efficient-client[all]

Development Installation

git clone https://github.com/yourusername/vllm-efficient-client.git
cd vllm-efficient-client
pip install -e .[dev]

Quick Start

Using vLLM Client (Offline Inference)

from vllm_efficient_client import VLLMClient, VLLMResourceConfig, SamplingConfig

# Configure vLLM resources
config = VLLMResourceConfig(
    gpu_memory_utilization=0.9,
    max_model_len=4096,
    max_num_seqs=128,
    max_num_batched_tokens=65536,
    block_size=16,
    tensor_parallel_size=1,
    dtype="bfloat16",
    trust_remote_code=True,
    disable_log_stats=True,
)

# Optional: Auto-scale for model size
config.scale_for_model_size(3)  # For 3B parameter model

# Initialize client
client = VLLMClient("meta-llama/Llama-3.2-3B-Instruct", config)

# Prepare prompts
prompts = [
    {
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "metadata": {"id": 1, "category": "geography"}
    },
    {
        "messages": [
            {"role": "user", "content": "Explain quantum computing."}
        ],
        "metadata": {"id": 2, "category": "science"}
    }
]

# Run inference
results = client.run_batch(
    prompts,
    SamplingConfig(temperature=0.7, max_tokens=100)
)

# Results include metadata + generated output
for result in results:
    print(f"ID: {result['id']}")
    print(f"Output: {result['output']}")
    print()

# Clean up
client.delete_client()

Using OpenAI Client (API Inference)

from vllm_efficient_client import OpenAIClient, OpenAIConfig, SamplingConfig

# Configure API client
config = OpenAIConfig(
    api_key="your-api-key",
    base_url="https://api.openai.com/v1/",  # or OpenRouter, DeepSeek, etc.
    enable_retry=True,
    max_retries=5,
)

# Initialize client
client = OpenAIClient("gpt-4", config)

# Prepare prompts
prompts = [
    {
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "metadata": {"qid": 1, "variant": "base", "seed": 42}
    }
]

# Run inference with auto-save (enables resume)
results = client.run_batch(
    prompts,
    SamplingConfig(temperature=0.7, max_tokens=100, seed=42),
    output_path="results.json"  # Auto-saves for resume on failure
)

# If interrupted, just run again - it will resume from checkpoint!

Advanced Usage

Switching Models (vLLM)

client = VLLMClient("model-1", config)
# ... do some work ...

# Switch to a different model
client.reset_client_to_another_model("model-2")
# ... continue with new model ...

Multiple Completions (OpenAI)

# Generate 5 different responses per prompt
results = client.run_batch(
    prompts,
    SamplingConfig(temperature=0.8, max_tokens=100, n=5),
    output_path="results.json"
)

# Each result contains a list of 5 outputs
for result in results:
    print(f"Generated {len(result['output'])} responses")

Custom Resource Scaling (vLLM)

config = VLLMResourceConfig(
    gpu_memory_utilization=0.9,
    max_model_len=8192,
    max_num_seqs=256,
    max_num_batched_tokens=131072,
    block_size=32,
    tensor_parallel_size=2,  # Use 2 GPUs
    dtype="float16",
    trust_remote_code=False,
    disable_log_stats=True,
    enable_prefix_caching=True,
)

# Automatically adjust for a 70B model
config.scale_for_model_size(70)

Configuration Options

VLLMResourceConfig

Parameter	Type	Description
`gpu_memory_utilization`	float	Fraction of GPU memory to use (0.0-1.0)
`max_model_len`	int	Maximum sequence length
`max_num_seqs`	int	Maximum sequences per iteration
`max_num_batched_tokens`	int	Maximum tokens in a batch
`block_size`	int	Token block size for paged attention
`tensor_parallel_size`	int	Number of GPUs for tensor parallelism
`dtype`	str	Data type ("float16", "bfloat16", "float32")
`trust_remote_code`	bool	Trust remote code from model hub
`enable_prefix_caching`	bool	Enable KV cache prefix caching

OpenAIConfig

Parameter	Type	Description
`api_key`	str	API key for authentication
`base_url`	str	Base URL for API endpoint
`enable_retry`	bool	Enable automatic retry on errors
`max_retries`	int	Maximum retry attempts
`initial_retry_delay`	float	Initial delay for exponential backoff (seconds)
`max_retry_delay`	float	Maximum delay for exponential backoff (seconds)

SamplingConfig

Parameter	Type	Default	Description
`temperature`	float	0.0	Sampling temperature (0.0 = deterministic)
`top_p`	float	1.0	Nucleus sampling threshold
`max_tokens`	int	2	Maximum tokens to generate
`n`	int	1	Number of completions (OpenAI only)
`seed`	int	None	Random seed for reproducibility

Supported Providers (OpenAI Client)

The OpenAI client automatically adapts to different providers:

OpenAI (api.openai.com)
OpenRouter (openrouter.ai)
DeepSeek (deepseek.com)
Anthropic (anthropic.com)
Google Gemini (google AI)
Local vLLM servers with OpenAI API compatibility

Project Structure

vllm_efficient_client/
├── src/
│   └── vllm_efficient_client/
│       ├── __init__.py          # Main exports
│       ├── base.py              # Base classes and interfaces
│       ├── vllm_client.py       # vLLM offline inference client
│       └── openai_client.py     # OpenAI API client
├── pyproject.toml               # Package configuration
├── README.md                    # This file
└── LICENSE                      # MIT License

Best Practices

For vLLM (Offline Inference)

Use scale_for_model_size() to automatically adjust parameters
Set dtype="bfloat16" for better performance on modern GPUs
Enable prefix_caching=True for repeated prefixes
Use tensor_parallel_size > 1 for large models (>30B params)
Always call delete_client() to free GPU memory

For OpenAI (API Inference)

Always set output_path to enable resume on failure
Use enable_retry=True for production workloads
Include qid, variant, and seed in metadata for proper resume
Monitor logs for quota exhaustion warnings
Set appropriate max_retries based on your rate limits

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this package in your research, please cite:

@software{vllm_efficient_client,
  title = {vLLM Efficient Client: Unified Interface for LLM Inference},
  author = {Your Name},
  year = {2025},
  url = {https://github.com/yourusername/vllm-efficient-client}
}

Acknowledgments

Built on top of vLLM
Uses OpenAI Python SDK
Inspired by the need for a unified LLM inference interface

Support

📧 Email: your.email@example.com
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Dec 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_efficient_client-0.1.0.tar.gz (11.9 kB view details)

Uploaded Dec 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_efficient_client-0.1.0-py3-none-any.whl (14.2 kB view details)

Uploaded Dec 10, 2025 Python 3

File details

Details for the file vllm_efficient_client-0.1.0.tar.gz.

File metadata

Download URL: vllm_efficient_client-0.1.0.tar.gz
Upload date: Dec 10, 2025
Size: 11.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for vllm_efficient_client-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`45e36d1417317f360dab72becdc300e318b00d07e49b378efe4c873386ffbf78`
MD5	`9afaf26617efa71345ac8df2e232d24d`
BLAKE2b-256	`d4d7448619d393b7e99fbdfb6882e1aad72abca18c1d7cb19cd4c83a176a7e50`

See more details on using hashes here.

File details

Details for the file vllm_efficient_client-0.1.0-py3-none-any.whl.

File metadata

Download URL: vllm_efficient_client-0.1.0-py3-none-any.whl
Upload date: Dec 10, 2025
Size: 14.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for vllm_efficient_client-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7330033fa401794cfae062de3193a087d27459fa2eaa1909e25c42982ee345d5`
MD5	`cb86192a4a37ff2a37a413fa18320301`
BLAKE2b-256	`de6befa71c67a104207a28da5d3736bf5201d12d798ded361a9869fd23b0b299`

See more details on using hashes here.

vllm-efficient-client 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

vLLM Efficient Client

Features

vLLM Client (Offline Inference)

OpenAI Client (API Inference)

Installation

Basic Installation

With vLLM Support (for offline inference)

With OpenAI Support (for API inference)

With All Features

Development Installation

Quick Start

Using vLLM Client (Offline Inference)

Using OpenAI Client (API Inference)

Advanced Usage

Switching Models (vLLM)

Multiple Completions (OpenAI)

Custom Resource Scaling (vLLM)

Configuration Options

VLLMResourceConfig

OpenAIConfig

SamplingConfig

Supported Providers (OpenAI Client)

Project Structure

Best Practices

For vLLM (Offline Inference)

For OpenAI (API Inference)

Contributing

License

Citation

Acknowledgments

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes