A minimal Flask API server for local HuggingFace LLMs
Project description
LLM REST API
The simplest possible Python code for running local LLM inference as a REST API server (with a simple client).
This package lets you start an inference server for Hugging Face–compatible models (like LLaMA, Qwen, GPT-OSS, etc.) on your own computer or server, and make it accessible to applications via HTTP.
See the Tutorial page for extented info.
Installation
From PyPI (recommended):
pip install min-llm-server-client
From source:
git clone https://github.com/afshinsadeghi/min_llm_server_client.git
cd min_llm_server_client
pip install .
Usage
Starting the Server
After installation, you can launch the server with the provided CLI entrypoint:
min-llm-server --model_name meta-llama/Llama-3.3-70B-Instruct --max_new_tokens 100 --device cuda:0
Options:
--model_name: Hugging Face model name or local path (e.g.openai/gpt-oss-20b,openai/gpt-oss-120b,meta-llama/Llama-3.3-70B-Instruct, or local model/path/to/model).--max_new_tokens: maximum number of tokens to generate in response.--device:cpu,cuda:0,cuda:1, etc.
Example (CPU run):
min-llm-server --model_name openai/gpt-oss-20b --max_new_tokens 50 --device cpu
Sending Queries
Once the server is running (default: http://127.0.0.1:5000/llm/q), you can query it with curl or Python.
Curl:
curl -X POST http://127.0.0.1:5000/llm/q \
-H "Content-Type: application/json" \
-d '{"query": "What is Earth?", "key": "key1"}'
Python client:
from min_llm_server_client.local_llm_inference_api_client import send_query
response = send_query("What is the capital of France?", user="user1", key="key1")
print(response)
Performance notes
- Running LLaMA 3.1 8B:
- Intel CPU → ~30 seconds per request, ~2.4 GB RAM
- A100 GPU → <1 second per request, ~34 GB GPU memory, ~4.8 GB CPU RAM
Project Structure
min_llm_server_client/
├── src/
│ ├── local_llm_inference_api_client.py
│ ├── local_llm_inference_server_api.py
│ └── ...
└── README.md
License
This project is open source under the Apache 2.0 License.
Author
Afshin Sadeghi
📧 sadeghi.afshin@gmail.com
🔗 GitHub
🔗 Google Scholar
🔗 LinkedIn
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file min_llm_server_client-0.3.7.1.tar.gz.
File metadata
- Download URL: min_llm_server_client-0.3.7.1.tar.gz
- Upload date:
- Size: 9.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f899dac5429e5d9e2125243501123d2f94a16980e84548206679443392e2b931
|
|
| MD5 |
4fdb6be2649b894faba765bd1c5a14cd
|
|
| BLAKE2b-256 |
1204189fbfa5e7de95de0300738b2fb8ba368ddfaef7831d7b60640a6ed42c38
|
File details
Details for the file min_llm_server_client-0.3.7.1-py3-none-any.whl.
File metadata
- Download URL: min_llm_server_client-0.3.7.1-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3563be75867f08312ccf63d4907b231738f67d0bf2b3addc5ef0d83e28ad84de
|
|
| MD5 |
128570f5c61e7e120f1b8acbd9427a8a
|
|
| BLAKE2b-256 |
fbd585744bff3ed35596887fc9d39b2fb40b6dd8e13e773716bdd85afed73fae
|