Skip to main content

A minimal API server for local HuggingFace LLMs or VLLM LLMs

Project description

Minimal LLM Server, for API calls PyPl Total Downloads

The simplest possible Python code for running local LLM inference as a REST API server and a simple client.

This package lets you start an inference server for Hugging Face–compatible models (like LLaMA, Qwen, GPT-OSS, etc.) on your own computer or server, and make it accessible to applications via HTTP. It supports both standard HuggingFace Transformers and high-performance vLLM backends.

See the Tutorial page for extented info.

Backend Options

This package now supports two inference backends:

1. HuggingFace Transformers (Standard)

  • ✓ Widely compatible
  • ✓ CPU support available
  • ✓ Smaller installation size
  • ✓ Good for development and testing

2. vLLM Optimized (High-Performance)

  • ✓ Up to 24x faster throughput than standard transformers
  • ✓ Lower latency for single requests
  • ✓ Better GPU memory utilization with PagedAttention
  • ✓ Automatic multi-GPU support with tensor parallelism
  • ✓ Continuous batching for higher throughput
  • ⚠ Requires CUDA GPUs (no CPU support)
  • ⚠ Best for production deployments

Installation by pip

Prerequisite

uv venv --python 3.12
source .venv/bin/activate

Standard light weight Installation (HuggingFace):

uv pip install min-llm-server-client

With vLLM Support:

uv pip install "min-llm-server-client[vllm]"

Installation From Source:

git clone https://github.com/afshinsadeghi/min_llm_server_client.git
cd min_llm_server_client

# Standard installation
uv pip install .

# Or with vLLM support
uv pip install ".[vllm]"

Usage

Starting the Server

Standard HuggingFace Transformers Server

uv run min-llm-server --model_name meta-llama/Llama-3.3-70B-Instruct --max_new_tokens 100 --device cuda:0

vLLM Optimized infernce Server

uv run min-llm-server-vllm --model_name openai/gpt-oss-20b --max_new_tokens 100 --device cuda:2

Command Options:

  • --model_name : Hugging Face model name or local path suggested models: openai/gpt-oss-20b openai/gpt-oss-120b meta-llama/Llama-3.3-70B-Instruct meta-llama/Llama-3.1-8B Qwen/Qwen3-0.6B Qwen/Qwen2-VL-72B-Instruct-AWQ deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

    or it can use a local model on your device with /path/to/model.

  • --max_new_tokens : maximum number of tokens to generate in response.

  • --device : Device selection

    • auto - Auto-detect available GPUs (default)
    • cpu, - Force CPU (HuggingFace only, vLLM requires GPU)
    • cuda:0, cuda:1 , or a list of GPU cores: cuda:2,3,4,5,6,7.
  • Specific to vLLM : --max_model_len : Maximum model context length. If not specified, will auto-detect from model config. Example: 8192

If the device parameter is not given or is auto, it finds the available GPU cores and uses them and if no gpu is available, it uses CPU instead.

Example run:

Standard server with default settings (auto GPU detection):

min-llm-server 

Standard server on a specific GPU (e.g., GPU 0):

min-llm-server --model_name openai/gpt-oss-20b --device cuda:0

Standard server on a specific GPU (e.g., GPU 1):

min-llm-server --model_name openai/gpt-oss-120b --device cuda:1

Standard server forced on CPU:

min-llm-server --model_name openai/gpt-oss-20b --max_new_tokens 50 --device cpu

vLLM server with auto GPU detection (uses all available GPUs):

min-llm-server-vllm --model_name meta-llama/Llama-3.3-70B-Instruct

vLLM server on a specific GPU (e.g., GPU 2):

min-llm-server-vllm --model_name meta-llama/Llama-3.3-70B-Instruct --device cuda:2

Standard server on a several GPUs:

min-llm-server --model_name meta-llama/Llama-3.3-70B-Instruct --device cuda:2,3,4,5,6,7

Sending Queries

Once the server is running (default: http://127.0.0.1:5000/llm/q), you can query it with curl or Python.

Curl:

curl -X POST http://127.0.0.1:5000/llm/q \
  -H "Content-Type: application/json" \
  -d '{"query": "What is Earth?", "key": "key1"}'

Python client:

from min_llm_server_client.local_llm_inference_api_client import send_query

response = send_query("What is the capital of France?", user="user1", key="key1")
print(response)

Performance Comparison

LLaMA 3.1 8B - Standard HuggingFace Backend:

  • Intel CPU → ~30 seconds per request, ~2.4 GB RAM
  • A100 GPU → <1 second per request, ~34 GB GPU memory, ~4.8 GB CPU RAM

LLaMA 3.1 8B - vLLM Optimized Backend:

  • A100 GPU → ~0.1-0.3 seconds per request (3-10x faster)
  • Better memory efficiency with PagedAttention
  • Supports higher concurrent request throughput

Performance Tips:

  • Use vLLM for production deployments with high request volumes
  • Use standard backend for development, testing, or CPU-only environments
  • Both the deployement method based on Hugging face and vLLM automatically utilize multiple GPUs, vLLM with tensor parallelism
  • Both backends support the same API, making it easy to switch

Project Structure

min_llm_server_client/
├── src/
│   ├── local_llm_inference_api_client.py
│   ├── local_llm_inference_server_api.py
│   └── ...
└── README.md

License

This project is open source under the Apache 2.0 License.


Author

Afshin Sadeghi
🔗 GitHub
🔗 Google Scholar
🔗 LinkedIn

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

min_llm_server_client-0.4.3.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

min_llm_server_client-0.4.3-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file min_llm_server_client-0.4.3.tar.gz.

File metadata

  • Download URL: min_llm_server_client-0.4.3.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for min_llm_server_client-0.4.3.tar.gz
Algorithm Hash digest
SHA256 32a23e06711794cdc83976d23ee0b10c9549199e8f59dc63801e4378a822bba4
MD5 18f6e1f3fb886010ef0b94a06384fe2f
BLAKE2b-256 80827047a034c4a3514490b0f6c8a9456d86ccece032ed37e7c99c8dbbaaa1fa

See more details on using hashes here.

File details

Details for the file min_llm_server_client-0.4.3-py3-none-any.whl.

File metadata

File hashes

Hashes for min_llm_server_client-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 dfa500a4dc3138d35c5eb086e745adbe920f0e5cdad0b0b9b4136201a9995ff6
MD5 e293068d0e46257d6a902c2d48128ec0
BLAKE2b-256 8325797e47ce46c7b747b92a1284e6d3b0eeadae627092e466b8c3cbbd755786

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page