Skip to main content

A Python wrapper of llama.cpp

Project description

xorbits

xllamacpp - a Python wrapper of llama.cpp

PyPI Latest Release License Discord Twitter


This project forks from cyllama and provides a Python wrapper for @ggerganov's llama.cpp which is likely the most active open-source compiled LLM inference engine.

Compare to llama-cpp-python

The following table provide an overview of the current implementations / features:

implementations / features xllamacpp llama-cpp-python
Wrapper-type cython ctypes
API Server & Params API Llama API
Server implementation C++ Python through wrapped LLama API
Continuous batching yes no
Thread safe yes no
Release package prebuilt build during installation

It goes without saying that any help / collaboration / contributions to accelerate the above would be welcome!

Wrapping Guidelines

As the intent is to provide a very thin wrapping layer and play to the strengths of the original c++ library as well as python, the approach to wrapping intentionally adopts the following guidelines:

  • In general, key structs are implemented as cython extension classses with related functions implemented as methods of said classes.

  • Be as consistent as possible with llama.cpp's naming of its api elements, except when it makes sense to shorten functions names which are used as methods.

  • Minimize non-wrapper python code.

Usage

Here is a simple example of how to use xllamacpp to get embeddings for a list of texts. For this example, you'll need an embedding model like Qwen3-Embedding-0.6B-Q8_0.gguf.

import xllamacpp as xlc

params = xlc.CommonParams()

params.model.path = "Qwen3-Embedding-0.6B-Q8_0.gguf"
params.embedding = True
params.pooling_type = xlc.llama_pooling_type.LLAMA_POOLING_TYPE_LAST

server = xlc.Server(params)

embedding_input = {
    "input": [
        "I believe the meaning of life is",
        "This is a test",
    ],
    "model": "My Qwen3 Model",
}

result = server.handle_embeddings(embedding_input)

print(result)

Output:

{'data': [{'embedding': [-0.006413215305656195,
                         -0.05906733125448227,
                         ...
                         -0.05887744203209877],
           'index': 0,
           'object': 'embedding'},
          {'embedding': [0.041170503944158554,
                         -0.004472420550882816,
                         ...
                         0.008314250037074089],
           'index': 1,
           'object': 'embedding'}],
 'model': 'My Qwen3 Model',
 'object': 'list',
 'usage': {'prompt_tokens': 11, 'total_tokens': 11}}

OpenAI API Compatible HTTP Server

The server provides OpenAI API compatible endpoints. For a complete list of available API endpoints, see the llama.cpp server documentation. You can use the OpenAI Python client:

import xllamacpp as xlc
from openai import OpenAI

# Start server
params = xlc.CommonParams()
params.model.path = "Llama-3.2-1B-Instruct-Q8_0.gguf"
server = xlc.Server(params)

# Connect using OpenAI client
client = OpenAI(
    base_url=server.listening_address + "/v1",
    api_key="not-required"  # No API key needed for local server
)

# Make chat completion request
response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_tokens=10
)

print(response.choices[0].message.content)

Prerequisites for Prebuilt Wheels

Before pip installing xllamacpp, please ensure your system meets the following requirements based on your build type:

  • CPU (aarch64):

    • Requires ARMv8-A or later architecture
    • For best performance, build from source if your CPU supports advanced instruction sets
  • CUDA (Linux):

    • Requires glibc 2.35 or later
    • Compatible NVIDIA GPU with appropriate drivers (CUDA 12.4 or 12.8)
  • ROCm (Linux):

    • Requires glibc 2.35 or later
    • Requires gcc 10 or later (ROCm libraries have this dependency)
    • Compatible AMD GPU with ROCm support (ROCm 6.3.4 or 6.4.1)
  • Vulkan (Linux/Windows, Intel/AMD/NVIDIA where supported):

    • Install the Vulkan SDK and GPU drivers with Vulkan support
    • Linux users may need distro packages and the LunarG SDK
    • macOS Intel is supported via Vulkan; Apple Silicon Vulkan is not supported in this project

Install

Note on Performance and Compatibility

For maximum performance, you can build xllamacpp from source to optimize for your specific native CPU architecture. The pre-built wheels are designed for broad compatibility.

Specifically, the aarch64 wheels are built for the armv8-a architecture. This ensures they run on a wide range of ARM64 devices, but it means that more advanced CPU instruction sets (like SVE) are not enabled. If your CPU supports these advanced features, building from source will provide better performance.

  • From pypi for CPU or Mac:
pip install -U xllamacpp
  • From github pypi for CUDA (use --force-reinstall to replace the installed CPU version):

    • CUDA 12.4

      pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/cu124
      
    • CUDA 12.8

      pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/cu128
      
  • From github pypi for HIP AMD GPU (use --force-reinstall to replace the installed CPU version):

    • ROCm 6.3.4

      pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/rocm-6.3.4
      
    • ROCm 6.4.1

      pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/rocm-6.4.1
      
  • From github pypi for Vulkan (use --force-reinstall to replace the installed CPU version):

    pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/vulkan
    

Build from Source

(Optional) Preparation

Build xllamacpp

  1. A recent version of python3 (testing on python 3.12)

  2. Install Rust toolchain (required for building):

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

For more installation options, see the rustup installation guide.

  1. Git clone the latest version of xllamacpp:
git clone git@github.com:xorbitsai/xllamacpp.git
cd xllamacpp
git submodule init
git submodule update
  1. Install dependencies of cython, setuptools, and pytest for testing:
pip install -r requirements.txt
  1. Select backend via environment and build. Examples:

    • CPU (default):

      make
      
    • CUDA:

      export XLLAMACPP_BUILD_CUDA=1
      make
      
    • HIP (AMD):

      export XLLAMACPP_BUILD_HIP=1
      make
      
    • Vulkan:

      export XLLAMACPP_BUILD_VULKAN=1
      make
      
    • Enable BLAS (optional):

      export CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS"
      make
      

Testing

The tests directory in this repo provides extensive examples of using xllamacpp.

However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. xllamacpp expects models to be stored in a models folder in the cloned xllamacpp directory. So to create the models directory if doesn't exist and download this model, you can just type:

make download

This basically just does:

cd xllamacpp
mkdir models && cd models
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf 

Now you can test it using llama-cli or llama-simple:

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"

You can also run the test suite with pytest by typing pytest or:

make test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xllamacpp-0.2.14.tar.gz (29.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

xllamacpp-0.2.14-cp310-abi3-win_amd64.whl (6.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

xllamacpp-0.2.14-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (28.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

xllamacpp-0.2.14-cp310-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.whl (27.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

xllamacpp-0.2.14-cp310-abi3-macosx_11_0_arm64.whl (8.5 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

xllamacpp-0.2.14-cp310-abi3-macosx_10_9_x86_64.whl (8.6 MB view details)

Uploaded CPython 3.10+macOS 10.9+ x86-64

File details

Details for the file xllamacpp-0.2.14.tar.gz.

File metadata

  • Download URL: xllamacpp-0.2.14.tar.gz
  • Upload date:
  • Size: 29.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for xllamacpp-0.2.14.tar.gz
Algorithm Hash digest
SHA256 b3816a7729af5817117c30c47f406ec28f30c82b12553135e02585199d4501d9
MD5 2667b2acef966183955f42db1cb31292
BLAKE2b-256 f338a098f3b23f2d09b75ea0505a3ef7f954c36f8ad01a939d07b4064ae27134

See more details on using hashes here.

File details

Details for the file xllamacpp-0.2.14-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: xllamacpp-0.2.14-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for xllamacpp-0.2.14-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 bd6a57bf5fb144ece9b79deac0041da9aeac31ffa3e37830a8bc4af0d1a184be
MD5 2ec60e628569a3bdc43e1b13010c992a
BLAKE2b-256 570421a26df81deebd4a7f3f89bce914f98038292360df85bdb50f7149d8ee08

See more details on using hashes here.

File details

Details for the file xllamacpp-0.2.14-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp-0.2.14-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 3584723d3c026c7233512e629f467194245490bde79e64eb15ce22a02adc8654
MD5 a6c1f2fe9616a84d75f90b4e5b72db64
BLAKE2b-256 d8c5e524cb68f1756d183b2085b9951a5c69f3d9a82f7fb4932d322cd7cda46e

See more details on using hashes here.

File details

Details for the file xllamacpp-0.2.14-cp310-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.whl.

File metadata

File hashes

Hashes for xllamacpp-0.2.14-cp310-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.whl
Algorithm Hash digest
SHA256 16d7916ac48ac05b38381499a8d7fe4188a2a4422b49ef7234e319d034e8bdb6
MD5 4a6f5c2faaf661301277f5c04cb6df73
BLAKE2b-256 7bbd55ff60e8a91c4aa4dda19725a197359e4f278efc5adc4edd890fc5390913

See more details on using hashes here.

File details

Details for the file xllamacpp-0.2.14-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for xllamacpp-0.2.14-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 39ed0846bb99fd6c023fa207486749b7428ebcc7b53b28cf9ad93767d4e3dff5
MD5 972e8b6ea89cff88f37e551a6040b39c
BLAKE2b-256 2182b2d1e60dbaf29fe02eb6a4165b47914cadca4f925da39825e82878078f92

See more details on using hashes here.

File details

Details for the file xllamacpp-0.2.14-cp310-abi3-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for xllamacpp-0.2.14-cp310-abi3-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ce5d2a433d94027468ca8ca8930ad76fe302406dc4bf95e34ef0dc07f7e144ef
MD5 4b703383d5ad6eda87515f4ffd9352bc
BLAKE2b-256 948c9fd46b447c057afc4b915789b35ae248c29151678a868aa4b479d1651f86

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page