Skip to main content

A Python wrapper of llama.cpp

Project description

xllamacpp - a Python wrapper of llama.cpp

This project forks from cyllama and provides a Python wrapper for @ggerganov's llama.cpp which is likely the most active open-source compiled LLM inference engine. It was spun-off from my earlier, now frozen, llama.cpp wrapper project, llamalib which provided early stage, but functional, wrappers using cython, pybind11, and nanobind. Further development of xllamacpp, the cython wrapper from llamalib, will continue in this project.

Development goals are to:

  • Stay up-to-date with bleeding-edge llama.cpp (last stable build with llama.cpp b4381)

  • Produce a minimal, performant, compiled, thin python wrapper around the core llama-cli feature-set of llama.cpp.

  • Integrate and wrap llava-cli features.

  • Integrate and wrap features from related projects such as whisper.cpp and stable-diffusion.cpp

  • Learn about the internals of this popular C++/C LLM inference engine along the way. For me at least, this is definitely the most efficient way to learn about the underlying technologies.

Given that there is a fairly mature, well-maintained and performant ctypes-based wrapper provided by @abetlen's llama-cpp-python project and that LLM inference is gpu-driven rather than cpu-driven, this all may see quite redundant. Nonetheless, we anticipate some benefits to using a compiled cython-based wrapper instead of ctypes:

  • Cython functions and extension classes can enforce strong type checking.

  • Packaging benefits with respect to self-contained statically compiled extension modules, which include simpler compilation and reduced package size.

  • There may be some performance improvements in the use of compiled wrappers over the use of ctypes.

  • It may be possible to incorporate external optimizations more readily into compiled wrappers.

  • It may be useful in case one wants to de-couple the python frontend and wrapper backends to existing frameworks: for example, to just replace the ctypes wrapper part in llama-cpp-python with compiled cython wrappers and contribute it back as a PR.

Status

Development is done only on macOS to keep things simple, with intermittent testing to ensure it works on Linux.

The following table provide an overview of the current wrapping/dev status:

status xllamacpp
wrapper-type cython
wrap llama.h + other headers yes
wrap high-level simple-cli yes
wrap low-level simple-cli yes
wrap low-level llama-cli WIP

The initial milestone entailed creating a high-level wrapper of the simple.cpp llama.cpp example, followed by a low-level one. The next objective is to fully wrap the functionality of llama-cli which is ongoing (see: xllamacpp.__init__.py).

It goes without saying that any help / collaboration / contributions to accelerate the above would be welcome!

Wrapping Guidelines

As the intent is to provide a very thin wrapping layer and play to the strengths of the original c++ library as well as python, the approach to wrapping intentionally adopts the following guidelines:

  • In general, key structs are implemented as cython extension classses with related functions implemented as methods of said classes.

  • Be as consistent as possible with llama.cpp's naming of its api elements, except when it makes sense to shorten functions names which are used as methods.

  • Minimize non-wrapper python code.

Setup

To build xllamacpp:

  1. A recent version of python3 (testing on python 3.12)

  2. Git clone the latest version of xllamacpp:

git clone https://github.com/shakfu/xllamacpp.git
cd xllamacpp
git submodule init
git submodule update
  1. Install dependencies of cython, setuptools, and pytest for testing:
pip install -r requirements.txt
  1. Type make in the terminal.

This will:

  1. Download and build llama.cpp
  2. Install it into bin, include, and lib in the cloned xllamacpp folder
  3. Build xllamacpp

Testing

The tests directory in this repo provides extensive examples of using xllamacpp.

However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. xllamacpp expects models to be stored in a models folder in the cloned xllamacpp directory. So to create the models directory if doesn't exist and download this model, you can just type:

make download

This basically just does:

cd xllamacpp
mkdir models && cd models
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf 

Now you can test it using llama-cli or llama-simple:

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"

You can also run the test suite with pytest by typing pytest or:

make test

If all tests pass, you can type python3 -i scripts/start.py or ipython -i scripts/start.py and explore the xllamacpp library with a pre-configured repl:

from xllamacpp import Llama
llm = Llama(model_path='models/Llama-3.2-1B-Instruct-Q8_0.gguf')
llm.ask("what is the age of the universe?")
'estimated age of the universe\nThe estimated age of the universe is around 13.8 billion years'

TODO

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

xllamacpp_cuda12x-0.1.6-cp312-cp312-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.12Windows x86-64

xllamacpp_cuda12x-0.1.6-cp312-cp312-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.6-cp311-cp311-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.11Windows x86-64

xllamacpp_cuda12x-0.1.6-cp311-cp311-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.6-cp310-cp310-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.10Windows x86-64

xllamacpp_cuda12x-0.1.6-cp310-cp310-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.6-cp39-cp39-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.9Windows x86-64

xllamacpp_cuda12x-0.1.6-cp39-cp39-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.35+ x86-64

File details

Details for the file xllamacpp_cuda12x-0.1.6-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.6-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 5b3ba27a0448e3fa557dae95e1a74a980f5bfe78fc00682615ee6a4dc7c148b3
MD5 95c0467dd9bed2a45c3424a121d2e537
BLAKE2b-256 be12958a6b79b422119061c70e04081a7963af1d5e69839373d00f6aae8edb3a

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.6-cp312-cp312-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.6-cp312-cp312-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 b7403101a0ed5eb4f817d3dff02b3229952e7989eacb866ce8a5c64690397f14
MD5 3164bf16e0ee15a8385890afb5f815a6
BLAKE2b-256 6a0a73feffa09452c310e8c4f3c05f96eefe5409a1243dfa366d9e16b5d0f75e

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.6-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.6-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 d54959c232e31fe40d4d2882a0eaaf056a1150583941404e6e853093cfba7ab4
MD5 f79ba7e003d1be445bc9fd47d70d2366
BLAKE2b-256 0448b486fc39424c59c3edcd6fcd46f19a36ecce8faf993d6e3881a0aec01686

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.6-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.6-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 722b5161e58773ddfcf51caa3eb7fb7d910f7a9b3131c9b0136ae058dde58efa
MD5 568733df46716f0404ab787319d7ee27
BLAKE2b-256 49f546530fe60a81dbe3fa7c10928ebccb12c56677644945cad43acad01d02f9

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.6-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.6-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 c455879baa71e4755aaa146aa78e9f122a4f5c49512d450f7192446a22e9bdc5
MD5 6439537a3c907ccebf15694132cb2fe3
BLAKE2b-256 603a51e242ad4dcbfb9707889a924b37a1867ab29a5a6fe0fbee9c29e504320d

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.6-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.6-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 b5be683fc64cfabd803d5af594b6ae697ae0fb42f643511fcb2029d1f739e92b
MD5 ce50cf9300117021f09859b93dc4594e
BLAKE2b-256 42b7825d80db649abb5437388bbfc80606a19e15e1856c47be1967a8b0dc372c

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.6-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.6-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 c8658c1602fed15b0afb43ca62ed80a416f59822d8c676e37100c9feadf59744
MD5 77d02755413c706ae66faee8fdde7ed8
BLAKE2b-256 fd1e1eed33482c9450f076fb5c70da016ded36aaa2fa80e1b2f8515358f11695

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.6-cp39-cp39-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.6-cp39-cp39-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 5b0829b1b6f3a922716026ffc6a7f1a5e14c1332e640da31f6be2a1e41f86aab
MD5 58cf12e191e4800813454a83298cdf9a
BLAKE2b-256 960138f360cc99ff03bd917a7dd702863abc8f7547cd01bae29ad0772da832d5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page