A minimal Flask API server for local HuggingFace LLMs
Project description
LLM REST API
The simplest possible python code for inference call of LLMs as a REST API server and a simple client for it
This setting both the server and client being written in Python and running on the same computer.
This is the basic code if you want to call LLMs in your server, or own computer and make it accessible to applications and codes locally.
Installation
To install the package, clone the repository and run the following command in the project root directory:
pip install .
You can install the dependencies using:
pip install -r requirements.txt
Usage
Configuration
The configuration settings, such as the model path and device settings, can be found in setting parameter of server. Make sure to update these settings according to your environment.
Starting the Server
To start the LLM inference server, run the following command:
To run:
python -m min_llm_server_client.src.local_llm_inference_server_api
You can set the arguments such as:
- --model_name
- --max_new_tokens
- --device
For example:
python -m min_llm_server_client.src.local_llm_inference_server_api --model_name meta-llama/Llama-3.3-70B-Instruct --max_new_tokens 100 --device cuda:1
device could be cpu or 0 or 1 or any other number meaning the core gpu number to use
running on cpu:
python -m min_llm_server_client.src.local_llm_inference_server_api --model_name openai/gpt-oss-20b --max_new_tokens 100 --device cpu
Usage on browser:
get test with curl http://127.0.0.1:5000/llm/q
or: post test no user : curl -X POST http://127.0.0.1:5000/llm/q -H "Content-Type: application/json" -d '{"query": "what is earth?"}'
post test no user : curl -X POST http://127.0.0.1:5000/llm/q -H "Content-Type: application/json" -d '{"query": "what is earth?" , "key": "key1"}'
Local test runs using lamma 3.1 8B:
intel cpu takes 30 seonds : memory cpu 2.4 used GB A100 gpu less than a seoncd; memroy GPU 34 GB , cpu 4.8 GB
Author's contact :
sadeghi.afshin@gmail.com
License
This project is open source. licensed under the Apache 2.0 License. See the LICENSE file for more details.
Explnation:
This project provides a simple REST API server and client for interacting with a local language model (LLM) inference server. The server is built using Flask and allows users to send queries to the model and receive generated responses.
Project Structure
llm_server_client
├── src
│ ├── __init__.py
│ ├── local_llm_inference_api_client.py
│ ├── local_llm_inference_server_api.py
│ └── setting.py
├── setup.py
└── README.md
Using in third party code
Sending Queries:
To interact with the server, you can use the client provided in src/local_llm_inference_api_client.py. This client includes functions to send queries to the server and handle responses.
Example usage
Here is a simple example of how to send a query to the server:
from src.local_llm_inference_api_client import send_query
response = send_query("What is the capital of France?", user="user1", key="key1")
print(response)
Dependencies
This project requires the following Python packages:
- Flask
- transformers
- sentencepiece
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file min_llm_server_client-0.1.0.tar.gz.
File metadata
- Download URL: min_llm_server_client-0.1.0.tar.gz
- Upload date:
- Size: 6.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bcbb5287ac73ca1aff8127b37bb4d5a2f14b81358cdc3402f8f673c5908529b5
|
|
| MD5 |
9b45e6f02b59e899d52daae6101d103e
|
|
| BLAKE2b-256 |
f832a8861f98663c641634daf26824e1b5092bd376fd568c39166e8cf7a846de
|
File details
Details for the file min_llm_server_client-0.1.0-py3-none-any.whl.
File metadata
- Download URL: min_llm_server_client-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f37c5e62bb35e4967f85878496105a13bab50c8f649506d4aa640ef42dd4f261
|
|
| MD5 |
0dd8813656089cf63ddc7116dc58eb76
|
|
| BLAKE2b-256 |
73f2ee81d14ebe6981bf141f4e76476031e80faaa2eadcba1019440423327be3
|