Skip to main content

A Python wrapper of llama.cpp

Project description

xllamacpp - a Python wrapper of llama.cpp

This project forks from cyllama and provides a Python wrapper for @ggerganov's llama.cpp which is likely the most active open-source compiled LLM inference engine. It was spun-off from my earlier, now frozen, llama.cpp wrapper project, llamalib which provided early stage, but functional, wrappers using cython, pybind11, and nanobind. Further development of xllamacpp, the cython wrapper from llamalib, will continue in this project.

Development goals are to:

  • Stay up-to-date with bleeding-edge llama.cpp (last stable build with llama.cpp b4381)

  • Produce a minimal, performant, compiled, thin python wrapper around the core llama-cli feature-set of llama.cpp.

  • Integrate and wrap llava-cli features.

  • Integrate and wrap features from related projects such as whisper.cpp and stable-diffusion.cpp

  • Learn about the internals of this popular C++/C LLM inference engine along the way. For me at least, this is definitely the most efficient way to learn about the underlying technologies.

Given that there is a fairly mature, well-maintained and performant ctypes-based wrapper provided by @abetlen's llama-cpp-python project and that LLM inference is gpu-driven rather than cpu-driven, this all may see quite redundant. Nonetheless, we anticipate some benefits to using a compiled cython-based wrapper instead of ctypes:

  • Cython functions and extension classes can enforce strong type checking.

  • Packaging benefits with respect to self-contained statically compiled extension modules, which include simpler compilation and reduced package size.

  • There may be some performance improvements in the use of compiled wrappers over the use of ctypes.

  • It may be possible to incorporate external optimizations more readily into compiled wrappers.

  • It may be useful in case one wants to de-couple the python frontend and wrapper backends to existing frameworks: for example, to just replace the ctypes wrapper part in llama-cpp-python with compiled cython wrappers and contribute it back as a PR.

Status

Development is done only on macOS to keep things simple, with intermittent testing to ensure it works on Linux.

The following table provide an overview of the current wrapping/dev status:

status xllamacpp
wrapper-type cython
wrap llama.h + other headers yes
wrap high-level simple-cli yes
wrap low-level simple-cli yes
wrap low-level llama-cli WIP

The initial milestone entailed creating a high-level wrapper of the simple.cpp llama.cpp example, followed by a low-level one. The next objective is to fully wrap the functionality of llama-cli which is ongoing (see: xllamacpp.__init__.py).

It goes without saying that any help / collaboration / contributions to accelerate the above would be welcome!

Wrapping Guidelines

As the intent is to provide a very thin wrapping layer and play to the strengths of the original c++ library as well as python, the approach to wrapping intentionally adopts the following guidelines:

  • In general, key structs are implemented as cython extension classses with related functions implemented as methods of said classes.

  • Be as consistent as possible with llama.cpp's naming of its api elements, except when it makes sense to shorten functions names which are used as methods.

  • Minimize non-wrapper python code.

Setup

To build xllamacpp:

  1. A recent version of python3 (testing on python 3.12)

  2. Git clone the latest version of xllamacpp:

git clone https://github.com/shakfu/xllamacpp.git
cd xllamacpp
git submodule init
git submodule update
  1. Install dependencies of cython, setuptools, and pytest for testing:
pip install -r requirements.txt
  1. Type make in the terminal.

This will:

  1. Download and build llama.cpp
  2. Install it into bin, include, and lib in the cloned xllamacpp folder
  3. Build xllamacpp

Testing

The tests directory in this repo provides extensive examples of using xllamacpp.

However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. xllamacpp expects models to be stored in a models folder in the cloned xllamacpp directory. So to create the models directory if doesn't exist and download this model, you can just type:

make download

This basically just does:

cd xllamacpp
mkdir models && cd models
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf 

Now you can test it using llama-cli or llama-simple:

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"

You can also run the test suite with pytest by typing pytest or:

make test

If all tests pass, you can type python3 -i scripts/start.py or ipython -i scripts/start.py and explore the xllamacpp library with a pre-configured repl:

from xllamacpp import Llama
llm = Llama(model_path='models/Llama-3.2-1B-Instruct-Q8_0.gguf')
llm.ask("what is the age of the universe?")
'estimated age of the universe\nThe estimated age of the universe is around 13.8 billion years'

TODO

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

xllamacpp_cuda12x-0.1.2-cp312-cp312-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.12Windows x86-64

xllamacpp_cuda12x-0.1.2-cp312-cp312-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.2-cp311-cp311-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.11Windows x86-64

xllamacpp_cuda12x-0.1.2-cp311-cp311-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.2-cp310-cp310-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.10Windows x86-64

xllamacpp_cuda12x-0.1.2-cp310-cp310-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.2-cp39-cp39-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.9Windows x86-64

xllamacpp_cuda12x-0.1.2-cp39-cp39-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.35+ x86-64

File details

Details for the file xllamacpp_cuda12x-0.1.2-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 15e7c84128b14afda8acc96785cf73df6398e96a66fc1ef31deee6ac4a140c43
MD5 d813b015a256f9bd10bebcff4c002898
BLAKE2b-256 df26784dac01799569184dc8ce84163b0da56a664bf42a41134541951e472938

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.2-cp312-cp312-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.2-cp312-cp312-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 f3580253749e0975a5b916310ba8c4625a807071ac5f4c1ed795df87243d0fac
MD5 4df8a7d2fbda028f6e5bd49422a84111
BLAKE2b-256 565e36c0ab2c979122026217d5f23e2167fc52e4a7b7ad7f71361edbc0ed9f42

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.2-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 55c1197be9a231b4d27d9e1ff4afc0f92f02bdb185d4b1d37c3405efbfe28c75
MD5 403fcfe6aaa4e363d652c70393603fc4
BLAKE2b-256 30679c7336d73c0a13b61aeea1880300e69a8af8052716de5c965e0e422a30e6

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.2-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.2-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 0fad7dde5e73388520109275c420d7e2cf294f35137b7918115c5ca241bbcd30
MD5 2d6a2fe7f50fb0e43dddd040c8aa93ad
BLAKE2b-256 fd8bfafa509f2d047b7fef7ebaa3f056896be643d41f275c217f0c4a29f3f143

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.2-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 4b6ae494dbb78fc0bbc469a1f2029ca2e08e156ff1d03a06fd34b0d33717db05
MD5 f419db5df48ff53f63a8a9e0dd924654
BLAKE2b-256 50ba1d271b9b0f2b696c42daf58e5c8d93961d72134268857a081e7004f2701e

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.2-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.2-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 a2cb3648f5f2fef86532d3866f1ebee39570f7e1e21705cf84ded53379113605
MD5 2ff46b931bfa295eadd44ed05c940e57
BLAKE2b-256 1d267c972fcadc07b94cfe87ea3b3538c866172fa868795f65fdf8e827e447dd

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.2-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 7fdfd3ba7fb9b15a8de56a396d9579060c93a0a0196d44e011ea2b5313ec2ca3
MD5 8a60c24fd524c7ff374eeac22ab62c3d
BLAKE2b-256 9c988efcee2ed917b55cf9ba0cb242f8c1a51bda1938e9099797fcd856f1936f

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.2-cp39-cp39-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.2-cp39-cp39-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 37e56e2186a3b740688a08389e61bb7a85f03b70132493d35c4ecb5909c58fb9
MD5 43a3f2da47c42843e4c0f7f9729e6529
BLAKE2b-256 424c19b0938974aad6e6118f932f7f74b9fd8ec3c539a76f18ffafdeedad9386

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page