A minimal API server for local HuggingFace LLMs or VLLM LLMs
Project description
Minimal LLM Server, for API calls
The simplest possible Python code for running local LLM inference as a REST API server (with a simple client).
This package lets you start an inference server for Hugging Face–compatible models (like LLaMA, Qwen, GPT-OSS, etc.) on your own computer or server, and make it accessible to applications via HTTP.
NEW: Now supports both standard HuggingFace Transformers and high-performance vLLM backends!
See the Tutorial page for extented info.
Backend Options
This package now supports two inference backends:
1. HuggingFace Transformers (Standard)
- ✓ Widely compatible
- ✓ CPU support available
- ✓ Smaller installation size
- ✓ Good for development and testing
2. vLLM Optimized (High-Performance) 🚀
- ✓ Up to 24x faster throughput than standard transformers
- ✓ Lower latency for single requests
- ✓ Better GPU memory utilization with PagedAttention
- ✓ Automatic multi-GPU support with tensor parallelism
- ✓ Continuous batching for higher throughput
- ⚠ Requires CUDA GPUs (no CPU support)
- ⚠ Best for production deployments
Installation
Installation by pip 
Standard Installation (HuggingFace):
pip install min-llm-server-client
With vLLM Support:
pip install "min-llm-server-client[vllm]"
Option 2: Installation From Source:
git clone https://github.com/afshinsadeghi/min_llm_server_client.git
cd min_llm_server_client
# Standard installation
pip install .
# Or with vLLM support
pip install ".[vllm]"
Usage
Starting the Server
Standard Server (HuggingFace Transformers)
min-llm-server --model_name meta-llama/Llama-3.3-70B-Instruct --max_new_tokens 100 --device cuda:0
#### vLLM Optimized Server (High-Performance) 🚀
min-llm-server-vllm --model_name meta-llama/Llama-3.3-70B-Instruct --max_new_tokens 100 --device auto
Command Options:
-
--model_name: Hugging Face model name or local path suggested models:openai/gpt-oss-20bopenai/gpt-oss-120bmeta-llama/Llama-3.3-70B-Instructmeta-llama/Llama-3.1-8BQwen/Qwen3-0.6BQwen/Qwen2-VL-72B-Instruct-AWQdeepseek-ai/DeepSeek-R1-Distill-Qwen-32Bor it can use a local model on your device with
/path/to/model. -
--max_new_tokens: maximum number of tokens to generate in response. -
--device: Device selectionauto- Auto-detect available GPUs (default)cpu, - Force CPU (HuggingFace only, vLLM requires GPU)cuda:0,cuda:1, or a list of GPU cores:cuda:2,3,4,5,6,7.
If the device parameter is not given or is auto, it finds the available GPU cores and uses them and if no gpu is available, it uses CPU instead.
Example run:
Standard server with default settings (auto GPU detection):
min-llm-server
Standard server on a specific GPU (e.g., GPU 0):
min-llm-server --model_name openai/gpt-oss-20b --device cuda:0
Standard server on a specific GPU (e.g., GPU 1):
min-llm-server --model_name openai/gpt-oss-120b --device cuda:1
Standard server forced on CPU:
min-llm-server --model_name openai/gpt-oss-20b --max_new_tokens 50 --device cpu
vLLM server with auto GPU detection (uses all available GPUs):
min-llm-server-vllm --model_name meta-llama/Llama-3.3-70B-Instruct
vLLM server on a specific GPU (e.g., GPU 2):
min-llm-server-vllm --model_name meta-llama/Llama-3.3-70B-Instruct --device cuda:2
Standard server on a several GPUs:
min-llm-server --model_name meta-llama/Llama-3.3-70B-Instruct --device cuda:2,3,4,5,6,7
Sending Queries
Once the server is running (default: http://127.0.0.1:5000/llm/q), you can query it with curl or Python.
Curl:
curl -X POST http://127.0.0.1:5000/llm/q \
-H "Content-Type: application/json" \
-d '{"query": "What is Earth?", "key": "key1"}'
Python client:
from min_llm_server_client.local_llm_inference_api_client import send_query
response = send_query("What is the capital of France?", user="user1", key="key1")
print(response)
Performance Comparison
LLaMA 3.1 8B - Standard HuggingFace Backend:
- Intel CPU → ~30 seconds per request, ~2.4 GB RAM
- A100 GPU → <1 second per request, ~34 GB GPU memory, ~4.8 GB CPU RAM
LLaMA 3.1 8B - vLLM Optimized Backend:
- A100 GPU → ~0.1-0.3 seconds per request (3-10x faster)
- Better memory efficiency with PagedAttention
- Supports higher concurrent request throughput
Performance Tips:
- Use vLLM for production deployments with high request volumes
- Use standard backend for development, testing, or CPU-only environments
- vLLM automatically utilizes multiple GPUs with tensor parallelism
- Both backends support the same API, making it easy to switch
Project Structure
min_llm_server_client/
├── src/
│ ├── local_llm_inference_api_client.py
│ ├── local_llm_inference_server_api.py
│ └── ...
└── README.md
License
This project is open source under the Apache 2.0 License.
Author
Afshin Sadeghi
🔗 GitHub
🔗 Google Scholar
🔗 LinkedIn
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file min_llm_server_client-0.4.0.tar.gz.
File metadata
- Download URL: min_llm_server_client-0.4.0.tar.gz
- Upload date:
- Size: 20.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61b1145c534d1ca40bf73b79d522f1baea20a55524259d3d1a824260fbf1b18b
|
|
| MD5 |
d6b866d4b157a4e9c95754bd2e12bd33
|
|
| BLAKE2b-256 |
5c6ad7c0134d38a80c5ea0d00077a37cae2cd04b47e7d9cb9f5897327786b28d
|
File details
Details for the file min_llm_server_client-0.4.0-py3-none-any.whl.
File metadata
- Download URL: min_llm_server_client-0.4.0-py3-none-any.whl
- Upload date:
- Size: 19.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b6b560f414994e951c8c8d2309672fa66971e75a8dba686208bef7caee0b562
|
|
| MD5 |
9d4f51d59d439f7717106e825284cdb5
|
|
| BLAKE2b-256 |
a835daf7c8be05457c2c244eea17bf7daf4a459dc2733adf8af796d3b78d57da
|