Skip to main content

A minimal API server for local HuggingFace LLMs

Project description

Minimal LLM Server, for API calls

The simplest possible Python code for running local LLM inference as a REST API server (with a simple client).

This package lets you start an inference server for Hugging Face–compatible models (like LLaMA, Qwen, GPT-OSS, etc.) on your own computer or server, and make it accessible to applications via HTTP.

See the Tutorial page for extented info.

Installation by pip Pepy Total Downloads

From PyPI (recommended):

pip install min-llm-server-client

From source:

git clone https://github.com/afshinsadeghi/min_llm_server_client.git
cd min_llm_server_client
pip install .

Usage

Starting the Server

After installation, you can launch the server with the provided CLI entrypoint:

min-llm-server --model_name meta-llama/Llama-3.3-70B-Instruct --max_new_tokens 100 --device cuda:0

Options:

  • --model_name : Hugging Face model name or local path suggested models: openai/gpt-oss-20b openai/gpt-oss-120b meta-llama/Llama-3.3-70B-Instruct meta-llama/Llama-3.1-8B Qwen/Qwen3-0.6B Qwen/Qwen2-VL-72B-Instruct-AWQ deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

    or it can be local model with /path/to/model

  • --max_new_tokens : maximum number of tokens to generate in response.

  • --device : cpu, cuda:0, cuda:1, etc. if device is not given it finds available GPU core and uses them and if not available it uses CPU instead.

Example (CPU run):

min-llm-server --model_name openai/gpt-oss-20b --max_new_tokens 50 --device cpu

Sending Queries

Once the server is running (default: http://127.0.0.1:5000/llm/q), you can query it with curl or Python.

Curl:

curl -X POST http://127.0.0.1:5000/llm/q \
  -H "Content-Type: application/json" \
  -d '{"query": "What is Earth?", "key": "key1"}'

Python client:

from min_llm_server_client.local_llm_inference_api_client import send_query

response = send_query("What is the capital of France?", user="user1", key="key1")
print(response)

Performance notes

  • Running LLaMA 3.1 8B:
    • Intel CPU → ~30 seconds per request, ~2.4 GB RAM
    • A100 GPU → <1 second per request, ~34 GB GPU memory, ~4.8 GB CPU RAM

Project Structure

min_llm_server_client/
├── src/
│   ├── local_llm_inference_api_client.py
│   ├── local_llm_inference_server_api.py
│   └── ...
└── README.md

License

This project is open source under the Apache 2.0 License.


Author

Afshin Sadeghi
📧 sadeghi.afshin@gmail.com
🔗 GitHub
🔗 Google Scholar
🔗 LinkedIn

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

min_llm_server_client-0.3.10.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

min_llm_server_client-0.3.10-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file min_llm_server_client-0.3.10.tar.gz.

File metadata

  • Download URL: min_llm_server_client-0.3.10.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.12

File hashes

Hashes for min_llm_server_client-0.3.10.tar.gz
Algorithm Hash digest
SHA256 0d85747adbd817c9c8feed9880f0f61b38b4da431a723f2b558a8aabdb8adad6
MD5 e0e6895f24ce654b0dab38fe3559a9de
BLAKE2b-256 40f191ac2bd1935c032f2bc96a5d32c0cfd4946464732b73d6416bc8a22d293a

See more details on using hashes here.

File details

Details for the file min_llm_server_client-0.3.10-py3-none-any.whl.

File metadata

File hashes

Hashes for min_llm_server_client-0.3.10-py3-none-any.whl
Algorithm Hash digest
SHA256 e65125c0ea2c51f12fd54f03ec197cd1f08521f780b14708c635ff9d150ad140
MD5 97fdf20ac634f4d3f07c5ae7771e9900
BLAKE2b-256 bcc7ff07c12e50c7f6660f00c4a55cdeec13f95b39a79d61079b892e072c9c40

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page