Skip to main content

A Python wrapper of llama.cpp

Project description

xllamacpp - a Python wrapper of llama.cpp

This project forks from cyllama and provides a Python wrapper for @ggerganov's llama.cpp which is likely the most active open-source compiled LLM inference engine. It was spun-off from my earlier, now frozen, llama.cpp wrapper project, llamalib which provided early stage, but functional, wrappers using cython, pybind11, and nanobind. Further development of xllamacpp, the cython wrapper from llamalib, will continue in this project.

Development goals are to:

  • Stay up-to-date with bleeding-edge llama.cpp (last stable build with llama.cpp b4381)

  • Produce a minimal, performant, compiled, thin python wrapper around the core llama-cli feature-set of llama.cpp.

  • Integrate and wrap llava-cli features.

  • Integrate and wrap features from related projects such as whisper.cpp and stable-diffusion.cpp

  • Learn about the internals of this popular C++/C LLM inference engine along the way. For me at least, this is definitely the most efficient way to learn about the underlying technologies.

Given that there is a fairly mature, well-maintained and performant ctypes-based wrapper provided by @abetlen's llama-cpp-python project and that LLM inference is gpu-driven rather than cpu-driven, this all may see quite redundant. Nonetheless, we anticipate some benefits to using a compiled cython-based wrapper instead of ctypes:

  • Cython functions and extension classes can enforce strong type checking.

  • Packaging benefits with respect to self-contained statically compiled extension modules, which include simpler compilation and reduced package size.

  • There may be some performance improvements in the use of compiled wrappers over the use of ctypes.

  • It may be possible to incorporate external optimizations more readily into compiled wrappers.

  • It may be useful in case one wants to de-couple the python frontend and wrapper backends to existing frameworks: for example, to just replace the ctypes wrapper part in llama-cpp-python with compiled cython wrappers and contribute it back as a PR.

Status

Development is done only on macOS to keep things simple, with intermittent testing to ensure it works on Linux.

The following table provide an overview of the current wrapping/dev status:

status xllamacpp
wrapper-type cython
wrap llama.h + other headers yes
wrap high-level simple-cli yes
wrap low-level simple-cli yes
wrap low-level llama-cli WIP

The initial milestone entailed creating a high-level wrapper of the simple.cpp llama.cpp example, followed by a low-level one. The next objective is to fully wrap the functionality of llama-cli which is ongoing (see: xllamacpp.__init__.py).

It goes without saying that any help / collaboration / contributions to accelerate the above would be welcome!

Wrapping Guidelines

As the intent is to provide a very thin wrapping layer and play to the strengths of the original c++ library as well as python, the approach to wrapping intentionally adopts the following guidelines:

  • In general, key structs are implemented as cython extension classses with related functions implemented as methods of said classes.

  • Be as consistent as possible with llama.cpp's naming of its api elements, except when it makes sense to shorten functions names which are used as methods.

  • Minimize non-wrapper python code.

Setup

To build xllamacpp:

  1. A recent version of python3 (testing on python 3.12)

  2. Git clone the latest version of xllamacpp:

git clone https://github.com/shakfu/xllamacpp.git
cd xllamacpp
git submodule init
git submodule update
  1. Install dependencies of cython, setuptools, and pytest for testing:
pip install -r requirements.txt
  1. Type make in the terminal.

This will:

  1. Download and build llama.cpp
  2. Install it into bin, include, and lib in the cloned xllamacpp folder
  3. Build xllamacpp

Testing

The tests directory in this repo provides extensive examples of using xllamacpp.

However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. xllamacpp expects models to be stored in a models folder in the cloned xllamacpp directory. So to create the models directory if doesn't exist and download this model, you can just type:

make download

This basically just does:

cd xllamacpp
mkdir models && cd models
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf 

Now you can test it using llama-cli or llama-simple:

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"

You can also run the test suite with pytest by typing pytest or:

make test

If all tests pass, you can type python3 -i scripts/start.py or ipython -i scripts/start.py and explore the xllamacpp library with a pre-configured repl:

from xllamacpp import Llama
llm = Llama(model_path='models/Llama-3.2-1B-Instruct-Q8_0.gguf')
llm.ask("what is the age of the universe?")
'estimated age of the universe\nThe estimated age of the universe is around 13.8 billion years'

TODO

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

xllamacpp_cuda12x-0.1.1-cp312-cp312-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.12Windows x86-64

xllamacpp_cuda12x-0.1.1-cp312-cp312-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.1-cp311-cp311-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.11Windows x86-64

xllamacpp_cuda12x-0.1.1-cp311-cp311-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.1-cp310-cp310-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.10Windows x86-64

xllamacpp_cuda12x-0.1.1-cp310-cp310-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.1-cp39-cp39-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.9Windows x86-64

xllamacpp_cuda12x-0.1.1-cp39-cp39-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.35+ x86-64

File details

Details for the file xllamacpp_cuda12x-0.1.1-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 0b6c61a03f5cf8b2844c157063226791419d81389aed7d12092d704f538a5174
MD5 e610d4a1ac53eceebf2aa2d620b9aa9e
BLAKE2b-256 1b3652f67f24fb1dd12f006923f9048447a254752abe789b740cc15ebe2fce30

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.1-cp312-cp312-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.1-cp312-cp312-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 35a6c6f409687da45702b43e5c9a09e88dfa98ea39373b0b9a0baee207935877
MD5 1d3a0b9e6da40611153e4d998580c69f
BLAKE2b-256 bb0c1e28291233032d591fbbd620dc728f7e7b7e061cb612e8403b1a6c310211

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 5324bf88c8137996520ed5e653145efba944df89556274123b0ef07f9a8671b0
MD5 8c852b1d110a5b2d1f7bf7bacd4a585f
BLAKE2b-256 6b0fee5abb64212d366cd11be397731e6e1afd482ca11425562660100c4fa31f

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.1-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.1-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 3dbd303740e4bdefb0b80422f72b8592e88b45a635821b07a084afdf6bf76393
MD5 3f577e18c6fc76bd42a48a8340f72b79
BLAKE2b-256 41ac45194d65397f49a7c4926ee414164c3e6065fcca46a468dfc0583c6cae73

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 1751c03f79f546dd724162a4cb574e294ff5c3299e5017edd714414d174bf410
MD5 ad151506e4103f016420e4f9355ab110
BLAKE2b-256 946a0e09817ee77d04dc074ae129db529aa55a55d47c2535391604a9f201560e

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.1-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.1-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 96a2da3cd32f796f998d4e2976744f913f2e6be3a711e9d98d1e03ef6a76659d
MD5 1e46b1ad0fac28694515750a771f3040
BLAKE2b-256 dcd05d7af1d1db033cf9643cfb87519005c172af22b4854a73fc1a31d02f114d

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.1-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 b8e220472debc197c54056da879bff431dd48a5fa0d19e6b795875c8ab64e1d7
MD5 7faa3937fbb7d615361f5f4b4579f8a1
BLAKE2b-256 0ef478137d74fc3591d0a55bff28424fb2f29187ab41abe0d2044ddb16a0e153

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.1-cp39-cp39-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.1-cp39-cp39-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 2cf253dfa591af2cae9174717b729bf571c9255c81272be830f6bb624de45fd0
MD5 3d0c0528827372a9a1ffd263e1af5b90
BLAKE2b-256 9a91a0cd11dd58ae093e31e405457603ac51d2071d9c8e09e827e981866fb9fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page