Python llama.cpp HTTP Server and LangChain LLM Client
Project description
python-llama-cpp-http
Python HTTP Server and LangChain LLM Client for llama.cpp.
Server has only two routes:
- call: for a prompt get whole text completion at once:
POST
/api/1.0/text/completion
- stream: for a prompt get text chunks via WebSocket:
GET
/api/1.0/text/completion
- embeddings: for a prompt get text embeddings:
POST
/api/1.0/text/embeddings
LangChain LLM Client has support for sync calls only based on Python packages requests
and websockets
.
Install
pip install llama_cpp_http
Manual install
Assumption is that GPU driver, and OpenCL / CUDA libraries are installed.
Make sure you follow instructions from LLAMA_CPP.md
below for one of following:
- CPU - including Apple, recommended for beginners
- OpenCL for AMDGPU/NVIDIA CLBlast
- HIP/ROCm for AMDGPU hipBLAS,
- CUDA for NVIDIA cuBLAS
It is the easiest to start with just CPU-based version of llama.cpp if you do not want to deal with GPU drivers and libraries.
Install build packages
- Arch/Manjaro:
sudo pacman -Sy base-devel python git jq
- Debian/Ubuntu:
sudo apt install build-essential python3-dev python3-venv python3-pip libffi-dev libssl-dev git jq
Clone repo
git clone https://github.com/mtasic85/python-llama-cpp-http.git
cd python-llama-cpp-http
Make sure you are inside cloned repo directory python-llama-cpp-http
.
Setup python venv
python -m venv venv
source venv/bin/activate
python -m ensurepip --upgrade
pip install -U .
Clone and compile llama.cpp
git clone https://github.com/ggerganov/llama.cpp llama.cpp
cd llama.cpp
make -j
Download Meta's Llama 2 7B Model
Download GGUF model from https://huggingface.co/TheBloke/Llama-2-7B-GGUF to local directory models
.
Our advice is to use model https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q2_K.gguf with minimum requirements, so it can fit in both RAM/VRAM.
Run Server
python -m llama_cpp_http.server --backend cpu --models-path ./models --llama-cpp-path ./llama.cpp
Experimental:
gunicorn 'llama_cpp_http.server:get_gunicorn_app(backend="clblast", models_path="~/models", llama_cpp_path="~/llama.cpp-clblast", platforms_devices="0:0")' --reload --bind '0.0.0.0:5000' --worker-class aiohttp.GunicornWebWorker
Run Client Examples
- Simple text completion call
/api/1.0/text/completion
:
python -B misc/example_client_call.py | jq .
- WebSocket stream
/api/1.0/text/completion
:
python -B misc/example_client_stream.py | jq -R '. as $line | try (fromjson) catch $line'
- Simple text embeddings call
/api/1.0/text/embeddings
:
python -B misc/example_client_langchain_embedding.py
Licensing
python-llama-cpp-http is licensed under the MIT license. Check the LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file llama_cpp_http-0.3.3.tar.gz
.
File metadata
- Download URL: llama_cpp_http-0.3.3.tar.gz
- Upload date:
- Size: 10.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.11.5 Linux/6.5.9-arch2-1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f4b033115391bebca744d396c925a56bc1a7763ff20bb068cec1c31f924fe0ed |
|
MD5 | bf629af0a5bfce2ff114226a188be453 |
|
BLAKE2b-256 | a9547a04fd32afca116a4f35f998ad0110cf1a114c715dca8ec9ab6a812deb20 |
File details
Details for the file llama_cpp_http-0.3.3-py3-none-any.whl
.
File metadata
- Download URL: llama_cpp_http-0.3.3-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.11.5 Linux/6.5.9-arch2-1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 653f87e553b3c0a42ec4125e137948f825dfe57a003ce5c7344e8f05215813cb |
|
MD5 | 4d3528fa60bc3c2656a018740f7bb7b3 |
|
BLAKE2b-256 | 209615fc59231f310f402269400528ebeb1752118e3ac0ec5c9b4510b6030fca |