Skip to main content

A minimal Flask API server for local HuggingFace LLMs

Project description

LLM REST API

The simplest possible python code for inference call of LLMs as a REST API server and a simple client for it

This setting both the server and client being written in Python and running on the same computer.

This is the basic code if you want to call LLMs in your server, or own computer and make it accessible to applications and codes locally.

Installation

To install the package, clone the repository and run the following command in the project root directory:

pip install .

You can install the dependencies using:

pip install -r requirements.txt

Usage

Configuration

The configuration settings, such as the model path and device settings, can be found in setting parameter of server. Make sure to update these settings according to your environment.

Starting the Server

To start the LLM inference server, run the following command:

To run:

 python -m min_llm_server_client.src.local_llm_inference_server_api

You can set the arguments such as:

  • --model_name
  • --max_new_tokens
  • --device

For example:

 python -m min_llm_server_client.src.local_llm_inference_server_api --model_name  meta-llama/Llama-3.3-70B-Instruct --max_new_tokens  100 --device cuda:1

device could be cpu or 0 or 1 or any other number meaning the core gpu number to use

running on cpu:

 python -m min_llm_server_client.src.local_llm_inference_server_api --model_name  openai/gpt-oss-20b --max_new_tokens  100 --device cpu
Usage on browser:

get test with curl http://127.0.0.1:5000/llm/q

or: post test no user : curl -X POST http://127.0.0.1:5000/llm/q -H "Content-Type: application/json" -d '{"query": "what is earth?"}'

post test no user : curl -X POST http://127.0.0.1:5000/llm/q -H "Content-Type: application/json" -d '{"query": "what is earth?" , "key": "key1"}'

Local test runs using lamma 3.1 8B:

intel cpu takes 30 seonds : memory cpu 2.4 used GB A100 gpu less than a seoncd; memroy GPU 34 GB , cpu 4.8 GB

Author's contact :

sadeghi.afshin@gmail.com

License

This project is open source. licensed under the Apache 2.0 License. See the LICENSE file for more details.

Explnation:

This project provides a simple REST API server and client for interacting with a local language model (LLM) inference server. The server is built using Flask and allows users to send queries to the model and receive generated responses.

Project Structure
llm_server_client
├── src
│   ├── __init__.py
│   ├── local_llm_inference_api_client.py
│   ├── local_llm_inference_server_api.py
│   └── setting.py
├── setup.py
└── README.md

Using in third party code

Sending Queries:

To interact with the server, you can use the client provided in src/local_llm_inference_api_client.py. This client includes functions to send queries to the server and handle responses.

Example usage

Here is a simple example of how to send a query to the server:

from src.local_llm_inference_api_client import send_query

response = send_query("What is the capital of France?", user="user1", key="key1")
print(response)

Dependencies

This project requires the following Python packages:

  • Flask
  • transformers
  • sentencepiece

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

min_llm_server_client-0.1.0.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

min_llm_server_client-0.1.0-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file min_llm_server_client-0.1.0.tar.gz.

File metadata

  • Download URL: min_llm_server_client-0.1.0.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.12

File hashes

Hashes for min_llm_server_client-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bcbb5287ac73ca1aff8127b37bb4d5a2f14b81358cdc3402f8f673c5908529b5
MD5 9b45e6f02b59e899d52daae6101d103e
BLAKE2b-256 f832a8861f98663c641634daf26824e1b5092bd376fd568c39166e8cf7a846de

See more details on using hashes here.

File details

Details for the file min_llm_server_client-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for min_llm_server_client-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f37c5e62bb35e4967f85878496105a13bab50c8f649506d4aa640ef42dd4f261
MD5 0dd8813656089cf63ddc7116dc58eb76
BLAKE2b-256 73f2ee81d14ebe6981bf141f4e76476031e80faaa2eadcba1019440423327be3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page