A minimal Flask API server for local HuggingFace LLMs

These details have not been verified by PyPI

Project description

LLM REST API

The simplest possible Python code for running local LLM inference as a REST API server (with a simple client).

This package lets you start an inference server for Hugging Face–compatible models (like LLaMA, Qwen, GPT-OSS, etc.) on your own computer or server, and make it accessible to applications via HTTP.

See the Tutorial page for extented info.

Installation

From PyPI (recommended):

pip install min-llm-server-client

From source:

git clone https://github.com/afshinsadeghi/min_llm_server_client.git
cd min_llm_server_client
pip install .

Usage

Starting the Server

After installation, you can launch the server with the provided CLI entrypoint:

min-llm-server --model_name meta-llama/Llama-3.3-70B-Instruct --max_new_tokens 100 --device cuda:0

Options:

--model_name : Hugging Face model name or local path (e.g. openai/gpt-oss-20b, openai/gpt-oss-120b, meta-llama/Llama-3.3-70B-Instruct, or local model /path/to/model).
--max_new_tokens : maximum number of tokens to generate in response.
--device : cpu, cuda:0, cuda:1, etc.

Example (CPU run):

min-llm-server --model_name openai/gpt-oss-20b --max_new_tokens 50 --device cpu

Sending Queries

Once the server is running (default: http://127.0.0.1:5000/llm/q), you can query it with curl or Python.

Curl:

curl -X POST http://127.0.0.1:5000/llm/q \
  -H "Content-Type: application/json" \
  -d '{"query": "What is Earth?", "key": "key1"}'

Python client:

from min_llm_server_client.local_llm_inference_api_client import send_query

response = send_query("What is the capital of France?", user="user1", key="key1")
print(response)

Performance notes

Running LLaMA 3.1 8B:
- Intel CPU → ~30 seconds per request, ~2.4 GB RAM
- A100 GPU → <1 second per request, ~34 GB GPU memory, ~4.8 GB CPU RAM

Project Structure

min_llm_server_client/
├── src/
│   ├── local_llm_inference_api_client.py
│   ├── local_llm_inference_server_api.py
│   └── ...
└── README.md

License

This project is open source under the Apache 2.0 License.

Author

Afshin Sadeghi
📧 sadeghi.afshin@gmail.com
🔗 GitHub
🔗 Google Scholar
🔗 LinkedIn

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.3

May 5, 2026

0.4.2

May 5, 2026

0.4.1

May 5, 2026

0.4.0

May 5, 2026

0.3.10

Oct 3, 2025

0.3.9

Sep 16, 2025

0.3.8

Sep 15, 2025

This version

0.3.7.1

Sep 15, 2025

0.3.7

Sep 15, 2025

0.3.5

Aug 28, 2025

0.2.0

Aug 28, 2025

0.1.0

Aug 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

min_llm_server_client-0.3.7.1.tar.gz (9.0 kB view details)

Uploaded Sep 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

min_llm_server_client-0.3.7.1-py3-none-any.whl (9.6 kB view details)

Uploaded Sep 15, 2025 Python 3

File details

Details for the file min_llm_server_client-0.3.7.1.tar.gz.

File metadata

Download URL: min_llm_server_client-0.3.7.1.tar.gz
Upload date: Sep 15, 2025
Size: 9.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for min_llm_server_client-0.3.7.1.tar.gz
Algorithm	Hash digest
SHA256	`f899dac5429e5d9e2125243501123d2f94a16980e84548206679443392e2b931`
MD5	`4fdb6be2649b894faba765bd1c5a14cd`
BLAKE2b-256	`1204189fbfa5e7de95de0300738b2fb8ba368ddfaef7831d7b60640a6ed42c38`

See more details on using hashes here.

File details

Details for the file min_llm_server_client-0.3.7.1-py3-none-any.whl.

File metadata

Download URL: min_llm_server_client-0.3.7.1-py3-none-any.whl
Upload date: Sep 15, 2025
Size: 9.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for min_llm_server_client-0.3.7.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3563be75867f08312ccf63d4907b231738f67d0bf2b3addc5ef0d83e28ad84de`
MD5	`128570f5c61e7e120f1b8acbd9427a8a`
BLAKE2b-256	`fbd585744bff3ed35596887fc9d39b2fb40b6dd8e13e773716bdd85afed73fae`

See more details on using hashes here.

min-llm-server-client 0.3.7.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LLM REST API

See the Tutorial page for extented info.

Installation

Usage

Starting the Server

Sending Queries

Performance notes

Project Structure

License

Author

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes