Skip to main content

A Python wrapper of llama.cpp

Project description

xllamacpp - a Python wrapper of llama.cpp

This project forks from cyllama and provides a Python wrapper for @ggerganov's llama.cpp which is likely the most active open-source compiled LLM inference engine. It was spun-off from my earlier, now frozen, llama.cpp wrapper project, llamalib which provided early stage, but functional, wrappers using cython, pybind11, and nanobind. Further development of xllamacpp, the cython wrapper from llamalib, will continue in this project.

Development goals are to:

  • Stay up-to-date with bleeding-edge llama.cpp (last stable build with llama.cpp b4381)

  • Produce a minimal, performant, compiled, thin python wrapper around the core llama-cli feature-set of llama.cpp.

  • Integrate and wrap llava-cli features.

  • Integrate and wrap features from related projects such as whisper.cpp and stable-diffusion.cpp

  • Learn about the internals of this popular C++/C LLM inference engine along the way. For me at least, this is definitely the most efficient way to learn about the underlying technologies.

Given that there is a fairly mature, well-maintained and performant ctypes-based wrapper provided by @abetlen's llama-cpp-python project and that LLM inference is gpu-driven rather than cpu-driven, this all may see quite redundant. Nonetheless, we anticipate some benefits to using a compiled cython-based wrapper instead of ctypes:

  • Cython functions and extension classes can enforce strong type checking.

  • Packaging benefits with respect to self-contained statically compiled extension modules, which include simpler compilation and reduced package size.

  • There may be some performance improvements in the use of compiled wrappers over the use of ctypes.

  • It may be possible to incorporate external optimizations more readily into compiled wrappers.

  • It may be useful in case one wants to de-couple the python frontend and wrapper backends to existing frameworks: for example, to just replace the ctypes wrapper part in llama-cpp-python with compiled cython wrappers and contribute it back as a PR.

Status

Development is done only on macOS to keep things simple, with intermittent testing to ensure it works on Linux.

The following table provide an overview of the current wrapping/dev status:

status xllamacpp
wrapper-type cython
wrap llama.h + other headers yes
wrap high-level simple-cli yes
wrap low-level simple-cli yes
wrap low-level llama-cli WIP

The initial milestone entailed creating a high-level wrapper of the simple.cpp llama.cpp example, followed by a low-level one. The next objective is to fully wrap the functionality of llama-cli which is ongoing (see: xllamacpp.__init__.py).

It goes without saying that any help / collaboration / contributions to accelerate the above would be welcome!

Wrapping Guidelines

As the intent is to provide a very thin wrapping layer and play to the strengths of the original c++ library as well as python, the approach to wrapping intentionally adopts the following guidelines:

  • In general, key structs are implemented as cython extension classses with related functions implemented as methods of said classes.

  • Be as consistent as possible with llama.cpp's naming of its api elements, except when it makes sense to shorten functions names which are used as methods.

  • Minimize non-wrapper python code.

Setup

To build xllamacpp:

  1. A recent version of python3 (testing on python 3.12)

  2. Git clone the latest version of xllamacpp:

git clone https://github.com/shakfu/xllamacpp.git
cd xllamacpp
git submodule init
git submodule update
  1. Install dependencies of cython, setuptools, and pytest for testing:
pip install -r requirements.txt
  1. Type make in the terminal.

This will:

  1. Download and build llama.cpp
  2. Install it into bin, include, and lib in the cloned xllamacpp folder
  3. Build xllamacpp

Testing

The tests directory in this repo provides extensive examples of using xllamacpp.

However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. xllamacpp expects models to be stored in a models folder in the cloned xllamacpp directory. So to create the models directory if doesn't exist and download this model, you can just type:

make download

This basically just does:

cd xllamacpp
mkdir models && cd models
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf 

Now you can test it using llama-cli or llama-simple:

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"

You can also run the test suite with pytest by typing pytest or:

make test

If all tests pass, you can type python3 -i scripts/start.py or ipython -i scripts/start.py and explore the xllamacpp library with a pre-configured repl:

from xllamacpp import Llama
llm = Llama(model_path='models/Llama-3.2-1B-Instruct-Q8_0.gguf')
llm.ask("what is the age of the universe?")
'estimated age of the universe\nThe estimated age of the universe is around 13.8 billion years'

TODO

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

xllamacpp_cuda12x-0.1.4-cp312-cp312-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.12Windows x86-64

xllamacpp_cuda12x-0.1.4-cp312-cp312-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.4-cp311-cp311-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.11Windows x86-64

xllamacpp_cuda12x-0.1.4-cp311-cp311-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.4-cp310-cp310-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.10Windows x86-64

xllamacpp_cuda12x-0.1.4-cp310-cp310-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.4-cp39-cp39-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.9Windows x86-64

xllamacpp_cuda12x-0.1.4-cp39-cp39-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.35+ x86-64

File details

Details for the file xllamacpp_cuda12x-0.1.4-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 7ed02103725bcbed4d5471dad0365a59afcc915ad5c2fb3af7c1cbefa779d03e
MD5 8efbd1a12b5df7859c1477cddabd8d03
BLAKE2b-256 11bafa22c00861e762e8f7f0f007e3b24fb4a1e6b720fe4976d2400ad895cbb9

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.4-cp312-cp312-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.4-cp312-cp312-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 a2d995a7ad164f1655cbbefaef6ff74123e15c253aaebb28da7fbae812c75743
MD5 f544f6e515bba2bcbd28af150a1b6b6b
BLAKE2b-256 611c192307117f0087e5faa8bcccd86ab7e0102c6fb99ca50346fb926a0264e1

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.4-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 9ed9d84862b3145af0cafdbffc8255e6e55e43acad43c176098d09fc5990bd2a
MD5 9d1ced687701fc1ae4392498598419e8
BLAKE2b-256 85abe65ad1cdea09a78314a530760ce353ff8c1546d803bf8d09d1086b5ddb98

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.4-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.4-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 311438ff0ca43d92a881099d4055848838fa73698f99493c90bbf877c02736f8
MD5 c0a20c4ead76b13280fe17b4d87535da
BLAKE2b-256 af2be93beacc35b65743cc8a3626984b1aa21f6b3b99d1eb4b3e1d25de6c5bb1

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.4-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 a42a3b22ab46fb986945724ec2bd53a933e93854051f74e8b0445968f69fe87d
MD5 083dbd152b3476de6c4abae69fd9e70d
BLAKE2b-256 94ddcd479867f261dad0fe54016bb17844a2aa045910f64736063c0334a95543

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.4-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.4-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 19bcba0d17db32d1c959d18a8d069192ee27027492f0bde51aa6878c82d70c5a
MD5 ee47ea6a435d1b43dcc25c9d6a48224c
BLAKE2b-256 49a94cc4b9f52cc4e86e62f845831a0097767262d16974a5b483828407fe9449

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.4-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 5f72fc835df525465dbd878f6c3c597fab99b21cce5165fd3566374889b76d4b
MD5 40ae32920725dbd7db5e4bc7c63c08b6
BLAKE2b-256 1c4eee6161b918da0a299bd4c93b57fe5878592a0f74074711abc1c0721521f0

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.4-cp39-cp39-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.4-cp39-cp39-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 e68079150dfb9a5757c7fd6f99ca130a52019263acf8d8518015effb81a2ce4f
MD5 4709fff218b727bad97f2e17dfe5e80a
BLAKE2b-256 8cee47c2a7dc1ddcb9f43d5d85dfd47a4a1fe03b035545aeab66c0bb19b318fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page