Skip to main content

A Python wrapper of llama.cpp

Project description

xllamacpp - a Python wrapper of llama.cpp

This project forks from cyllama and provides a Python wrapper for @ggerganov's llama.cpp which is likely the most active open-source compiled LLM inference engine. It was spun-off from my earlier, now frozen, llama.cpp wrapper project, llamalib which provided early stage, but functional, wrappers using cython, pybind11, and nanobind. Further development of xllamacpp, the cython wrapper from llamalib, will continue in this project.

Development goals are to:

  • Stay up-to-date with bleeding-edge llama.cpp (last stable build with llama.cpp b4381)

  • Produce a minimal, performant, compiled, thin python wrapper around the core llama-cli feature-set of llama.cpp.

  • Integrate and wrap llava-cli features.

  • Integrate and wrap features from related projects such as whisper.cpp and stable-diffusion.cpp

  • Learn about the internals of this popular C++/C LLM inference engine along the way. For me at least, this is definitely the most efficient way to learn about the underlying technologies.

Given that there is a fairly mature, well-maintained and performant ctypes-based wrapper provided by @abetlen's llama-cpp-python project and that LLM inference is gpu-driven rather than cpu-driven, this all may see quite redundant. Nonetheless, we anticipate some benefits to using a compiled cython-based wrapper instead of ctypes:

  • Cython functions and extension classes can enforce strong type checking.

  • Packaging benefits with respect to self-contained statically compiled extension modules, which include simpler compilation and reduced package size.

  • There may be some performance improvements in the use of compiled wrappers over the use of ctypes.

  • It may be possible to incorporate external optimizations more readily into compiled wrappers.

  • It may be useful in case one wants to de-couple the python frontend and wrapper backends to existing frameworks: for example, to just replace the ctypes wrapper part in llama-cpp-python with compiled cython wrappers and contribute it back as a PR.

Status

Development is done only on macOS to keep things simple, with intermittent testing to ensure it works on Linux.

The following table provide an overview of the current wrapping/dev status:

status xllamacpp
wrapper-type cython
wrap llama.h + other headers yes
wrap high-level simple-cli yes
wrap low-level simple-cli yes
wrap low-level llama-cli WIP

The initial milestone entailed creating a high-level wrapper of the simple.cpp llama.cpp example, followed by a low-level one. The next objective is to fully wrap the functionality of llama-cli which is ongoing (see: xllamacpp.__init__.py).

It goes without saying that any help / collaboration / contributions to accelerate the above would be welcome!

Wrapping Guidelines

As the intent is to provide a very thin wrapping layer and play to the strengths of the original c++ library as well as python, the approach to wrapping intentionally adopts the following guidelines:

  • In general, key structs are implemented as cython extension classses with related functions implemented as methods of said classes.

  • Be as consistent as possible with llama.cpp's naming of its api elements, except when it makes sense to shorten functions names which are used as methods.

  • Minimize non-wrapper python code.

Setup

To build xllamacpp:

  1. A recent version of python3 (testing on python 3.12)

  2. Git clone the latest version of xllamacpp:

git clone https://github.com/shakfu/xllamacpp.git
cd xllamacpp
git submodule init
git submodule update
  1. Install dependencies of cython, setuptools, and pytest for testing:
pip install -r requirements.txt
  1. Type make in the terminal.

This will:

  1. Download and build llama.cpp
  2. Install it into bin, include, and lib in the cloned xllamacpp folder
  3. Build xllamacpp

Testing

The tests directory in this repo provides extensive examples of using xllamacpp.

However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. xllamacpp expects models to be stored in a models folder in the cloned xllamacpp directory. So to create the models directory if doesn't exist and download this model, you can just type:

make download

This basically just does:

cd xllamacpp
mkdir models && cd models
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf 

Now you can test it using llama-cli or llama-simple:

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"

You can also run the test suite with pytest by typing pytest or:

make test

If all tests pass, you can type python3 -i scripts/start.py or ipython -i scripts/start.py and explore the xllamacpp library with a pre-configured repl:

from xllamacpp import Llama
llm = Llama(model_path='models/Llama-3.2-1B-Instruct-Q8_0.gguf')
llm.ask("what is the age of the universe?")
'estimated age of the universe\nThe estimated age of the universe is around 13.8 billion years'

TODO

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

xllamacpp_cuda12x-0.1.5-cp312-cp312-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.12Windows x86-64

xllamacpp_cuda12x-0.1.5-cp312-cp312-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.5-cp311-cp311-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.11Windows x86-64

xllamacpp_cuda12x-0.1.5-cp311-cp311-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.5-cp310-cp310-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.10Windows x86-64

xllamacpp_cuda12x-0.1.5-cp310-cp310-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.5-cp39-cp39-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.9Windows x86-64

xllamacpp_cuda12x-0.1.5-cp39-cp39-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.35+ x86-64

File details

Details for the file xllamacpp_cuda12x-0.1.5-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.5-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a718584090da84cb5287f1f2c4324fd59631397ab807a068835674ba4dcf4799
MD5 3f0ae963db3f14be5497ab2a2b7fa598
BLAKE2b-256 fe4b80dc9a6ba21c21e9ec0e9e340e8d6135133c91cd10066d0b23d861557a9a

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.5-cp312-cp312-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.5-cp312-cp312-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 3d673e9a3a7fd88cddbd140a9e303cbdbd95d3263d530f7ebdba1cb5ac568877
MD5 12035ab35e08c72100614d7d12a67c20
BLAKE2b-256 b29929805ad8dd8d012bc176789748cd13b3b9ad14875a2472e7a53e15fb0e9c

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.5-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 aa4225bff8e3013919bbd35c5ed83bc1507108e0728c2fe4a787619afb93b53a
MD5 559d91b2eefa039a84f372333a1dff68
BLAKE2b-256 7be8d6bee7edaebe8ea2537aa8fbbf02107c81f0a74afd8bf4872d51b15bfeca

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.5-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.5-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 754af09d67d3239d17a620de66a43bc13a4a22f93a4ee706f913de44108b10cd
MD5 a644f94b1fb841fb902b0f1af81cb112
BLAKE2b-256 63f082780c19b999840b16fcc90fac9319cee5680ba574458d6a7ea7dd2c9f20

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.5-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.5-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 90dd577a719b5332d5c9d2a47aca214d285a04b63d69f3b6d0034d0d041eb7d1
MD5 fb10435c623005702f7d38122cadf14c
BLAKE2b-256 63150ff333f133ce7eb20d14a1f13256691c6d080c2e0fd1df6775462243901a

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.5-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.5-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 2d2d1083070a49783801d0b470b9b34371c261880475c3ac24845f6c2ae1f462
MD5 ca22dd2430410da06483f58072711614
BLAKE2b-256 c1f31d90b05a3236ece3a5535b458c629c80474090fc7865bce491bd633c22d6

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.5-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.5-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 a51f2b06907a7595b71b660bbd81a8d8d433cedba52ee3ab7187e36906a9d387
MD5 f8c31189b5b19b891fb921035127cff2
BLAKE2b-256 73e24beab039a9d3add90ba582691975f0cd0fa929f38d5018486f6abb5646cd

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.5-cp39-cp39-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.5-cp39-cp39-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 17bbabdd7f9abeb393552020e80b12a1be1ecfe3bd4f9417ca404c17566b5424
MD5 04d40450ef65b9b670944426ef16a284
BLAKE2b-256 968a0889f36167ea01a08c9d9e3fdb3a1a61e4b332769d202932e8c823391acb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page