Skip to main content

A Python wrapper of llama.cpp

Project description

xllamacpp - a Python wrapper of llama.cpp

This project forks from cyllama and provides a Python wrapper for @ggerganov's llama.cpp which is likely the most active open-source compiled LLM inference engine. It was spun-off from my earlier, now frozen, llama.cpp wrapper project, llamalib which provided early stage, but functional, wrappers using cython, pybind11, and nanobind. Further development of xllamacpp, the cython wrapper from llamalib, will continue in this project.

Development goals are to:

  • Stay up-to-date with bleeding-edge llama.cpp (last stable build with llama.cpp b4381)

  • Produce a minimal, performant, compiled, thin python wrapper around the core llama-cli feature-set of llama.cpp.

  • Integrate and wrap llava-cli features.

  • Integrate and wrap features from related projects such as whisper.cpp and stable-diffusion.cpp

  • Learn about the internals of this popular C++/C LLM inference engine along the way. For me at least, this is definitely the most efficient way to learn about the underlying technologies.

Given that there is a fairly mature, well-maintained and performant ctypes-based wrapper provided by @abetlen's llama-cpp-python project and that LLM inference is gpu-driven rather than cpu-driven, this all may see quite redundant. Nonetheless, we anticipate some benefits to using a compiled cython-based wrapper instead of ctypes:

  • Cython functions and extension classes can enforce strong type checking.

  • Packaging benefits with respect to self-contained statically compiled extension modules, which include simpler compilation and reduced package size.

  • There may be some performance improvements in the use of compiled wrappers over the use of ctypes.

  • It may be possible to incorporate external optimizations more readily into compiled wrappers.

  • It may be useful in case one wants to de-couple the python frontend and wrapper backends to existing frameworks: for example, to just replace the ctypes wrapper part in llama-cpp-python with compiled cython wrappers and contribute it back as a PR.

Status

Development is done only on macOS to keep things simple, with intermittent testing to ensure it works on Linux.

The following table provide an overview of the current wrapping/dev status:

status xllamacpp
wrapper-type cython
wrap llama.h + other headers yes
wrap high-level simple-cli yes
wrap low-level simple-cli yes
wrap low-level llama-cli WIP

The initial milestone entailed creating a high-level wrapper of the simple.cpp llama.cpp example, followed by a low-level one. The next objective is to fully wrap the functionality of llama-cli which is ongoing (see: xllamacpp.__init__.py).

It goes without saying that any help / collaboration / contributions to accelerate the above would be welcome!

Wrapping Guidelines

As the intent is to provide a very thin wrapping layer and play to the strengths of the original c++ library as well as python, the approach to wrapping intentionally adopts the following guidelines:

  • In general, key structs are implemented as cython extension classses with related functions implemented as methods of said classes.

  • Be as consistent as possible with llama.cpp's naming of its api elements, except when it makes sense to shorten functions names which are used as methods.

  • Minimize non-wrapper python code.

Setup

To build xllamacpp:

  1. A recent version of python3 (testing on python 3.12)

  2. Git clone the latest version of xllamacpp:

git clone https://github.com/shakfu/xllamacpp.git
cd xllamacpp
git submodule init
git submodule update
  1. Install dependencies of cython, setuptools, and pytest for testing:
pip install -r requirements.txt
  1. Type make in the terminal.

This will:

  1. Download and build llama.cpp
  2. Install it into bin, include, and lib in the cloned xllamacpp folder
  3. Build xllamacpp

Testing

The tests directory in this repo provides extensive examples of using xllamacpp.

However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. xllamacpp expects models to be stored in a models folder in the cloned xllamacpp directory. So to create the models directory if doesn't exist and download this model, you can just type:

make download

This basically just does:

cd xllamacpp
mkdir models && cd models
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf 

Now you can test it using llama-cli or llama-simple:

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"

You can also run the test suite with pytest by typing pytest or:

make test

If all tests pass, you can type python3 -i scripts/start.py or ipython -i scripts/start.py and explore the xllamacpp library with a pre-configured repl:

from xllamacpp import Llama
llm = Llama(model_path='models/Llama-3.2-1B-Instruct-Q8_0.gguf')
llm.ask("what is the age of the universe?")
'estimated age of the universe\nThe estimated age of the universe is around 13.8 billion years'

TODO

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

xllamacpp_cuda12x-0.1.3-cp312-cp312-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.12Windows x86-64

xllamacpp_cuda12x-0.1.3-cp312-cp312-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.3-cp311-cp311-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.11Windows x86-64

xllamacpp_cuda12x-0.1.3-cp311-cp311-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.3-cp310-cp310-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.10Windows x86-64

xllamacpp_cuda12x-0.1.3-cp310-cp310-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

xllamacpp_cuda12x-0.1.3-cp39-cp39-win_amd64.whl (22.0 MB view details)

Uploaded CPython 3.9Windows x86-64

xllamacpp_cuda12x-0.1.3-cp39-cp39-manylinux_2_35_x86_64.whl (22.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.35+ x86-64

File details

Details for the file xllamacpp_cuda12x-0.1.3-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 957ef31d9855273e1f310e3bd2fa6957606a0462e46a2799cd5396a498dd065b
MD5 496fc01ae0e8f55b26786de1745e82eb
BLAKE2b-256 4cb61155d5cd9ee43e2bbdaeacf6665fb2ef339e7f36c406e791fee1db695cc3

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.3-cp312-cp312-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.3-cp312-cp312-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 c71d1ec84f3a42b593a46ffc3b8d6017400d53f3780b258189f5e28a2bc1c27b
MD5 301b079c39ca52de0816723e9695ccd2
BLAKE2b-256 a29efeff3416b962155878bd4a5f26e9e27420e31b664286f0d041bded489641

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.3-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 6c8b7b9837a4c6a3524577cd17795513af45888d541a5370e019bb69474ee28d
MD5 465e12ddbad933784ee1e601c00dad82
BLAKE2b-256 daf459195bf9d017cb9d1b4642e40cd82e87eec0c66a2aa4ebe52eabaf45dff3

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.3-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.3-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 70b57ada9bd5f41350bf99c486d967d337f47a5b3f0b1e7b2243fd4ee8cb209d
MD5 04e515551966d8ace0a5f2d417700da1
BLAKE2b-256 a9a47550c81c500981e8862c8e16948a0d3828257f7675ffc93345c620b42eb4

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.3-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 07d3db90c1d6852891d87d47891ffcd8242286407588957ff1cbc3db3a7a5cc0
MD5 43a5bfeeb25dd7be4c56cbccbe62c350
BLAKE2b-256 52f656102360969ccb2fb02a749346d3955e8da07ad1230b0c6c27168e12a830

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.3-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.3-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 36859e4391a2e852c30b021eef7cadef559e9d95cff70841a130f3d60282ad3f
MD5 da1bc3b1dd2d6d0dc5e4057a3206963b
BLAKE2b-256 b904743c73b26f463234bfb456620c59e7c468a7960df4458720ac2e0328fdf8

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.3-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 a557e9cc2c0294e3a85f09f9c7115923012b0701947aaaef47522ed417f247e2
MD5 6486185d37648e15f11b31c6e44067d3
BLAKE2b-256 01c6731ff82bc4677354bec7a5cdca4e8deaa9babb017a1d28bcfd6e581f61c0

See more details on using hashes here.

File details

Details for the file xllamacpp_cuda12x-0.1.3-cp39-cp39-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp_cuda12x-0.1.3-cp39-cp39-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 0d8a1cc862e549f4472b978a67b4f367395047b136a557e16e32966c4501580c
MD5 4eba3d0bc5defe5fce9fcca16e66e871
BLAKE2b-256 836102840807d35a4b62157dad29c596a9a22b4f084b7ee3af1239fdaa46a555

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page