Skip to main content

Python bindings for the llama.cpp library

Project description

🦙 Python Bindings for llama.cpp

Documentation Status Tests PyPI PyPI - Python Version PyPI - License PyPI - Downloads

Simple Python bindings for @ggerganov's llama.cpp library. This package provides:

  • Low-level access to C API via ctypes interface.
  • High-level Python API for text completion
    • OpenAI-like API
    • LangChain compatibility

Documentation is available at https://llama-cpp-python.readthedocs.io/en/latest.

[!WARNING]
Starting with version 0.1.79 the model format has changed from ggmlv3 to gguf. Old model files can be converted using the convert-llama-ggmlv3-to-gguf.py script in llama.cpp

Installation from PyPI

Install from PyPI (requires a c compiler):

pip install llama-cpp-python

The above command will attempt to install the package and build llama.cpp from source. This is the recommended installation method as it ensures that llama.cpp is built with the available optimizations for your system.

If you have previously installed llama-cpp-python through pip and want to upgrade your version or rebuild the package with different compiler options, please add the following flags to ensure that the package is rebuilt correctly:

pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh

Otherwise, while installing it will build the llama.ccp x86 version which will be 10x slower on Apple Silicon (M1) Mac.

Installation with Hardware Acceleration

llama.cpp supports multiple BLAS backends for faster processing.

To install with OpenBLAS, set the LLAMA_BLAS and LLAMA_BLAS_VENDOR environment variables before installing:

CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python

To install with cuBLAS, set the LLAMA_CUBLAS=1 environment variable before installing:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

To install with CLBlast, set the LLAMA_CLBLAST=1 environment variable before installing:

CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python

To install with Metal (MPS), set the LLAMA_METAL=on environment variable before installing:

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

To install with hipBLAS / ROCm support for AMD cards, set the LLAMA_HIPBLAS=on environment variable before installing:

CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python

Windows remarks

To set the variables CMAKE_ARGSin PowerShell, follow the next steps (Example using, OpenBLAS):

$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"

Then, call pip after setting the variables:

pip install llama-cpp-python

See the above instructions and set CMAKE_ARGS to the BLAS backend you want to use.

MacOS remarks

Detailed MacOS Metal GPU install documentation is available at docs/install/macos.md

High-level API

The high-level API provides a simple managed interface through the Llama class.

Below is a short example demonstrating how to use the high-level API to generate text:

>>> from llama_cpp import Llama
>>> llm = Llama(model_path="./models/7B/llama-model.gguf")
>>> output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
>>> print(output)
{
  "id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "object": "text_completion",
  "created": 1679561337,
  "model": "./models/7B/llama-model.gguf",
  "choices": [
    {
      "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
      "index": 0,
      "logprobs": None,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 28,
    "total_tokens": 42
  }
}

Adjusting the Context Window

The context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.

For instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:

llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=2048)

Loading llama-2 70b

Llama2 70b must set the n_gqa parameter (grouped-query attention factor) to 8 when loading:

llm = Llama(model_path="./models/70B/llama-model.gguf", n_gqa=8)

Web Server

llama-cpp-python offers a web server which aims to act as a drop-in replacement for the OpenAI API. This allows you to use llama.cpp compatible models with any OpenAI compatible client (language libraries, services, etc).

To install the server package and get started:

pip install llama-cpp-python[server]
python3 -m llama_cpp.server --model models/7B/llama-model.gguf

Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python[server]
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35

Navigate to http://localhost:8000/docs to see the OpenAPI documentation.

Docker image

A Docker image is available on GHCR. To run the server:

docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest

Docker on termux (requires root) is currently the only known way to run this on phones, see termux support issue

Low-level API

The low-level API is a direct ctypes binding to the C API provided by llama.cpp. The entire low-level API can be found in llama_cpp/llama_cpp.py and directly mirrors the C API in llama.h.

Below is a short example demonstrating how to use the low-level API to tokenize a prompt:

>>> import llama_cpp
>>> import ctypes
>>> llama_cpp.llama_backend_init(numa=False) # Must be called once at the start of each program
>>> params = llama_cpp.llama_context_default_params()
# use bytes for char * params
>>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf", params)
>>> ctx = llama_cpp.llama_new_context_with_model(model, params)
>>> max_tokens = params.n_ctx
# use ctypes arrays for array params
>>> tokens = (llama_cpp.llama_token * int(max_tokens))()
>>> n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, add_bos=llama_cpp.c_bool(True))
>>> llama_cpp.llama_free(ctx)

Check out the examples folder for more examples of using the low-level API.

Documentation

Documentation is available at https://abetlen.github.io/llama-cpp-python. If you find any issues with the documentation, please open an issue or submit a PR.

Development

This package is under active development and I welcome any contributions.

To get started, clone the repository and install the package in editable / development mode:

git clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python

# Upgrade pip (required for editable mode)
pip install --upgrade pip

# Install with pip
pip install -e .

# if you want to use the fastapi / openapi server
pip install -e .[server]

# to install all optional dependencies
pip install -e .[all]

# to clear the local build cache
make clean

How does this compare to other Python bindings of llama.cpp?

I originally wrote this package for my own use with two goals in mind:

  • Provide a simple process to install llama.cpp and access the full C API in llama.h from Python
  • Provide a high-level Python API that can be used as a drop-in replacement for the OpenAI API so existing apps can be easily ported to use llama.cpp

Any contributions and changes to this package will be made with these goals in mind.

License

This project is licensed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_cpp_python_kirp-0.2.6.tar.gz (9.3 MB view details)

Uploaded Source

Built Distribution

llama_cpp_python_kirp-0.2.6-cp311-cp311-manylinux_2_35_x86_64.whl (971.1 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.35+ x86-64

File details

Details for the file llama_cpp_python_kirp-0.2.6.tar.gz.

File metadata

  • Download URL: llama_cpp_python_kirp-0.2.6.tar.gz
  • Upload date:
  • Size: 9.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for llama_cpp_python_kirp-0.2.6.tar.gz
Algorithm Hash digest
SHA256 7af682be26fc77ceea9ce45c49c946682cb1894ce575711bf2496cbc3ae0c710
MD5 e597c48e26b06036e69ba5fc76614a99
BLAKE2b-256 7fe007be367d44143b3db1a89ba530d1be8e97fe26e3765d769a7f3f747916b8

See more details on using hashes here.

File details

Details for the file llama_cpp_python_kirp-0.2.6-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for llama_cpp_python_kirp-0.2.6-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 c1f07d4e8722cb6cd128de8b9936c123e855d68ebb352790494298ab7cb589a5
MD5 e5dedca62c399b0fde637acd37e12bb0
BLAKE2b-256 0203eee91803d220572f1ccdd577125dcfbe1b3c7bc5c60d256a6c80a1004741

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page