Skip to main content

Python llama.cpp HTTP Server and LangChain LLM Client

Project description

python-llama-cpp-http

Downloads Supported Versions License: MIT

Python HTTP Server and LangChain LLM Client for llama.cpp.

Server has only two routes:

  • call: for a prompt get whole text completion at once: POST /api/1.0/text/completion
  • stream: for a prompt get text chunks via WebSocket: GET /api/1.0/text/completion
  • embeddings: for a prompt get text embeddings: POST /api/1.0/text/embeddings

LangChain LLM Client has support for sync calls only based on Python packages requests and websockets.

Install

pip install llama_cpp_http

Manual install

Assumption is that GPU driver, and OpenCL / CUDA libraries are installed.

Make sure you follow instructions from LLAMA_CPP.md below for one of following:

  • CPU - including Apple, recommended for beginners
  • OpenCL for AMDGPU/NVIDIA CLBlast
  • HIP/ROCm for AMDGPU hipBLAS,
  • CUDA for NVIDIA cuBLAS

It is the easiest to start with just CPU-based version of llama.cpp if you do not want to deal with GPU drivers and libraries.

Install build packages

  • Arch/Manjaro: sudo pacman -Sy base-devel python git jq
  • Debian/Ubuntu: sudo apt install build-essential python3-dev python3-venv python3-pip libffi-dev libssl-dev git jq

Clone repo

git clone https://github.com/mtasic85/python-llama-cpp-http.git
cd python-llama-cpp-http

Make sure you are inside cloned repo directory python-llama-cpp-http.

Setup python venv

python -m venv venv
source venv/bin/activate
python -m ensurepip --upgrade
pip install -U .

Clone and compile llama.cpp

git clone https://github.com/ggerganov/llama.cpp llama.cpp
cd llama.cpp
make -j

Download Meta's Llama 2 7B Model

Download GGUF model from https://huggingface.co/TheBloke/Llama-2-7B-GGUF to local directory models.

Our advice is to use model https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q2_K.gguf with minimum requirements, so it can fit in both RAM/VRAM.

Run Server

python -m llama_cpp_http.server --backend cpu --models-path ./models --llama-cpp-path ./llama.cpp

Experimental:

gunicorn 'llama_cpp_http.server:get_gunicorn_app(backend="clblast", models_path="~/models", llama_cpp_path="~/llama.cpp-clblast", platforms_devices="0:0")' --reload --bind '0.0.0.0:5000' --worker-class aiohttp.GunicornWebWorker

Run Client Examples

  1. Simple text completion call /api/1.0/text/completion:
python -B misc/example_client_call.py | jq .
  1. WebSocket stream /api/1.0/text/completion:
python -B misc/example_client_stream.py | jq -R '. as $line | try (fromjson) catch $line'
  1. Simple text embeddings call /api/1.0/text/embeddings:
python -B misc/example_client_langchain_embedding.py

Licensing

python-llama-cpp-http is licensed under the MIT license. Check the LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_cpp_http-0.3.3.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

llama_cpp_http-0.3.3-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file llama_cpp_http-0.3.3.tar.gz.

File metadata

  • Download URL: llama_cpp_http-0.3.3.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.5 Linux/6.5.9-arch2-1

File hashes

Hashes for llama_cpp_http-0.3.3.tar.gz
Algorithm Hash digest
SHA256 f4b033115391bebca744d396c925a56bc1a7763ff20bb068cec1c31f924fe0ed
MD5 bf629af0a5bfce2ff114226a188be453
BLAKE2b-256 a9547a04fd32afca116a4f35f998ad0110cf1a114c715dca8ec9ab6a812deb20

See more details on using hashes here.

File details

Details for the file llama_cpp_http-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: llama_cpp_http-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.5 Linux/6.5.9-arch2-1

File hashes

Hashes for llama_cpp_http-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 653f87e553b3c0a42ec4125e137948f825dfe57a003ce5c7344e8f05215813cb
MD5 4d3528fa60bc3c2656a018740f7bb7b3
BLAKE2b-256 209615fc59231f310f402269400528ebeb1752118e3ac0ec5c9b4510b6030fca

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page