Skip to main content

A unified interface for efficient LLM inference with vLLM and OpenAI-compatible APIs

Project description

vLLM Efficient Client

A unified Python package for efficient Large Language Model (LLM) inference, supporting both:

  • vLLM offline inference - High-throughput batch inference on local GPUs
  • OpenAI-compatible APIs - Remote inference with automatic retry and resume

Features

vLLM Client (Offline Inference)

  • 🚀 High-throughput batch inference using vLLM
  • 🎯 Direct token feeding for optimal performance
  • 💾 Automatic GPU memory management
  • 🔄 Model switching without restarting
  • 📝 HuggingFace chat template support

OpenAI Client (API Inference)

  • 🌐 Works with any OpenAI-compatible API (OpenAI, OpenRouter, DeepSeek, Anthropic, etc.)
  • 🔁 Automatic retry with exponential backoff
  • 💾 Resume from checkpoint (auto-saves progress)
  • 🛡️ Graceful quota exhaustion handling
  • 🎯 Provider-specific parameter optimization

Installation

Basic Installation

pip install vllm-efficient-client

With vLLM Support (for offline inference)

pip install vllm-efficient-client[vllm]

With OpenAI Support (for API inference)

pip install vllm-efficient-client[openai]

With All Features

pip install vllm-efficient-client[all]

Development Installation

git clone https://github.com/yourusername/vllm-efficient-client.git
cd vllm-efficient-client
pip install -e .[dev]

Quick Start

Using vLLM Client (Offline Inference)

from vllm_efficient_client import VLLMClient, VLLMResourceConfig, SamplingConfig

# Configure vLLM resources
config = VLLMResourceConfig(
    gpu_memory_utilization=0.9,
    max_model_len=4096,
    max_num_seqs=128,
    max_num_batched_tokens=65536,
    block_size=16,
    tensor_parallel_size=1,
    dtype="bfloat16",
    trust_remote_code=True,
    disable_log_stats=True,
)

# Optional: Auto-scale for model size
config.scale_for_model_size(3)  # For 3B parameter model

# Initialize client
client = VLLMClient("meta-llama/Llama-3.2-3B-Instruct", config)

# Prepare prompts
prompts = [
    {
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "metadata": {"id": 1, "category": "geography"}
    },
    {
        "messages": [
            {"role": "user", "content": "Explain quantum computing."}
        ],
        "metadata": {"id": 2, "category": "science"}
    }
]

# Run inference
results = client.run_batch(
    prompts,
    SamplingConfig(temperature=0.7, max_tokens=100)
)

# Results include metadata + generated output
for result in results:
    print(f"ID: {result['id']}")
    print(f"Output: {result['output']}")
    print()

# Clean up
client.delete_client()

Using OpenAI Client (API Inference)

from vllm_efficient_client import OpenAIClient, OpenAIConfig, SamplingConfig

# Configure API client
config = OpenAIConfig(
    api_key="your-api-key",
    base_url="https://api.openai.com/v1/",  # or OpenRouter, DeepSeek, etc.
    enable_retry=True,
    max_retries=5,
)

# Initialize client
client = OpenAIClient("gpt-4", config)

# Prepare prompts
prompts = [
    {
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "metadata": {"qid": 1, "variant": "base", "seed": 42}
    }
]

# Run inference with auto-save (enables resume)
results = client.run_batch(
    prompts,
    SamplingConfig(temperature=0.7, max_tokens=100, seed=42),
    output_path="results.json"  # Auto-saves for resume on failure
)

# If interrupted, just run again - it will resume from checkpoint!

Advanced Usage

Switching Models (vLLM)

client = VLLMClient("model-1", config)
# ... do some work ...

# Switch to a different model
client.reset_client_to_another_model("model-2")
# ... continue with new model ...

Multiple Completions (OpenAI)

# Generate 5 different responses per prompt
results = client.run_batch(
    prompts,
    SamplingConfig(temperature=0.8, max_tokens=100, n=5),
    output_path="results.json"
)

# Each result contains a list of 5 outputs
for result in results:
    print(f"Generated {len(result['output'])} responses")

Custom Resource Scaling (vLLM)

config = VLLMResourceConfig(
    gpu_memory_utilization=0.9,
    max_model_len=8192,
    max_num_seqs=256,
    max_num_batched_tokens=131072,
    block_size=32,
    tensor_parallel_size=2,  # Use 2 GPUs
    dtype="float16",
    trust_remote_code=False,
    disable_log_stats=True,
    enable_prefix_caching=True,
)

# Automatically adjust for a 70B model
config.scale_for_model_size(70)

Configuration Options

VLLMResourceConfig

Parameter Type Description
gpu_memory_utilization float Fraction of GPU memory to use (0.0-1.0)
max_model_len int Maximum sequence length
max_num_seqs int Maximum sequences per iteration
max_num_batched_tokens int Maximum tokens in a batch
block_size int Token block size for paged attention
tensor_parallel_size int Number of GPUs for tensor parallelism
dtype str Data type ("float16", "bfloat16", "float32")
trust_remote_code bool Trust remote code from model hub
enable_prefix_caching bool Enable KV cache prefix caching

OpenAIConfig

Parameter Type Description
api_key str API key for authentication
base_url str Base URL for API endpoint
enable_retry bool Enable automatic retry on errors
max_retries int Maximum retry attempts
initial_retry_delay float Initial delay for exponential backoff (seconds)
max_retry_delay float Maximum delay for exponential backoff (seconds)

SamplingConfig

Parameter Type Default Description
temperature float 0.0 Sampling temperature (0.0 = deterministic)
top_p float 1.0 Nucleus sampling threshold
max_tokens int 2 Maximum tokens to generate
n int 1 Number of completions (OpenAI only)
seed int None Random seed for reproducibility

Supported Providers (OpenAI Client)

The OpenAI client automatically adapts to different providers:

  • OpenAI (api.openai.com)
  • OpenRouter (openrouter.ai)
  • DeepSeek (deepseek.com)
  • Anthropic (anthropic.com)
  • Google Gemini (google AI)
  • Local vLLM servers with OpenAI API compatibility

Project Structure

vllm_efficient_client/
├── src/
│   └── vllm_efficient_client/
│       ├── __init__.py          # Main exports
│       ├── base.py              # Base classes and interfaces
│       ├── vllm_client.py       # vLLM offline inference client
│       └── openai_client.py     # OpenAI API client
├── pyproject.toml               # Package configuration
├── README.md                    # This file
└── LICENSE                      # MIT License

Best Practices

For vLLM (Offline Inference)

  1. Use scale_for_model_size() to automatically adjust parameters
  2. Set dtype="bfloat16" for better performance on modern GPUs
  3. Enable prefix_caching=True for repeated prefixes
  4. Use tensor_parallel_size > 1 for large models (>30B params)
  5. Always call delete_client() to free GPU memory

For OpenAI (API Inference)

  1. Always set output_path to enable resume on failure
  2. Use enable_retry=True for production workloads
  3. Include qid, variant, and seed in metadata for proper resume
  4. Monitor logs for quota exhaustion warnings
  5. Set appropriate max_retries based on your rate limits

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this package in your research, please cite:

@software{vllm_efficient_client,
  title = {vLLM Efficient Client: Unified Interface for LLM Inference},
  author = {Your Name},
  year = {2025},
  url = {https://github.com/yourusername/vllm-efficient-client}
}

Acknowledgments

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_efficient_client-0.1.0.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllm_efficient_client-0.1.0-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file vllm_efficient_client-0.1.0.tar.gz.

File metadata

  • Download URL: vllm_efficient_client-0.1.0.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for vllm_efficient_client-0.1.0.tar.gz
Algorithm Hash digest
SHA256 45e36d1417317f360dab72becdc300e318b00d07e49b378efe4c873386ffbf78
MD5 9afaf26617efa71345ac8df2e232d24d
BLAKE2b-256 d4d7448619d393b7e99fbdfb6882e1aad72abca18c1d7cb19cd4c83a176a7e50

See more details on using hashes here.

File details

Details for the file vllm_efficient_client-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for vllm_efficient_client-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7330033fa401794cfae062de3193a087d27459fa2eaa1909e25c42982ee345d5
MD5 cb86192a4a37ff2a37a413fa18320301
BLAKE2b-256 de6befa71c67a104207a28da5d3736bf5201d12d798ded361a9869fd23b0b299

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page