A Python wrapper of llama.cpp

Project description

xllamacpp - a Python wrapper of llama.cpp

This project forks from cyllama and provides a Python wrapper for @ggerganov's llama.cpp which is likely the most active open-source compiled LLM inference engine.

Compare to llama-cpp-python

The following table provide an overview of the current implementations / features:

implementations / features	xllamacpp	llama-cpp-python
Wrapper-type	cython	ctypes
API	Server & Params API	Llama API
Server implementation	C++	Python through wrapped LLama API
Continuous batching	yes	no
Thread safe	yes	no
Release package	prebuilt	build during installation

It goes without saying that any help / collaboration / contributions to accelerate the above would be welcome!

Wrapping Guidelines

As the intent is to provide a very thin wrapping layer and play to the strengths of the original c++ library as well as python, the approach to wrapping intentionally adopts the following guidelines:

In general, key structs are implemented as cython extension classses with related functions implemented as methods of said classes.
Be as consistent as possible with llama.cpp's naming of its api elements, except when it makes sense to shorten functions names which are used as methods.
Minimize non-wrapper python code.

Usage

Here is a simple example of how to use xllamacpp to get embeddings for a list of texts. For this example, you'll need an embedding model like Qwen3-Embedding-0.6B-Q8_0.gguf.

import xllamacpp as xlc

params = xlc.CommonParams()

params.model.path = "Qwen3-Embedding-0.6B-Q8_0.gguf"
params.embedding = True
params.pooling_type = xlc.llama_pooling_type.LLAMA_POOLING_TYPE_LAST

server = xlc.Server(params)

embedding_input = {
    "input": [
        "I believe the meaning of life is",
        "This is a test",
    ],
    "model": "My Qwen3 Model",
}

result = server.handle_embeddings(embedding_input)

print(result)

Output:

{'data': [{'embedding': [-0.006413215305656195,
                         -0.05906733125448227,
                         ...
                         -0.05887744203209877],
           'index': 0,
           'object': 'embedding'},
          {'embedding': [0.041170503944158554,
                         -0.004472420550882816,
                         ...
                         0.008314250037074089],
           'index': 1,
           'object': 'embedding'}],
 'model': 'My Qwen3 Model',
 'object': 'list',
 'usage': {'prompt_tokens': 11, 'total_tokens': 11}}

OpenAI API Compatible HTTP Server

The server provides OpenAI API compatible endpoints. For a complete list of available API endpoints, see the llama.cpp server documentation. You can use the OpenAI Python client:

import xllamacpp as xlc
from openai import OpenAI

# Start server
params = xlc.CommonParams()
params.model.path = "Llama-3.2-1B-Instruct-Q8_0.gguf"
server = xlc.Server(params)

# Connect using OpenAI client
client = OpenAI(
    base_url=server.listening_address + "/v1",
    api_key="not-required"  # No API key needed for local server
)

# Make chat completion request
response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_tokens=10
)

print(response.choices[0].message.content)

Prerequisites for Prebuilt Wheels

Before pip installing xllamacpp, please ensure your system meets the following requirements based on your build type:

CPU (aarch64):
- Requires ARMv8-A or later architecture
- For best performance, build from source if your CPU supports advanced instruction sets
CUDA (Linux):
- Requires glibc 2.35 or later
- Compatible NVIDIA GPU with appropriate drivers (CUDA 12.4 or 12.8)
ROCm (Linux):
- Requires glibc 2.35 or later
- Requires gcc 10 or later (ROCm libraries have this dependency)
- Compatible AMD GPU with ROCm support (ROCm 6.3.4 or 6.4.1)
Vulkan (Linux/Windows, Intel/AMD/NVIDIA where supported):
- Install the Vulkan SDK and GPU drivers with Vulkan support
- Linux users may need distro packages and the LunarG SDK
- macOS Intel is supported via Vulkan; Apple Silicon Vulkan is not supported in this project

Install

Note on Performance and Compatibility

For maximum performance, you can build xllamacpp from source to optimize for your specific native CPU architecture. The pre-built wheels are designed for broad compatibility.

Specifically, the aarch64 wheels are built for the armv8-a architecture. This ensures they run on a wide range of ARM64 devices, but it means that more advanced CPU instruction sets (like SVE) are not enabled. If your CPU supports these advanced features, building from source will provide better performance.

From pypi for CPU or Mac:

pip install -U xllamacpp

From github pypi for CUDA (use --force-reinstall to replace the installed CPU version):

CUDA 12.4

pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/cu124

CUDA 12.8

pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/cu128

From github pypi for HIP AMD GPU (use --force-reinstall to replace the installed CPU version):

ROCm 6.3.4

pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/rocm-6.3.4

ROCm 6.4.1

pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/rocm-6.4.1

From github pypi for Vulkan (use --force-reinstall to replace the installed CPU version):

pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/vulkan

Build from Source

(Optional) Preparation

CUDA

This provides GPU acceleration using an NVIDIA GPU. Make sure to have the CUDA toolkit installed.

Download directly from NVIDIA

You may find the official downloads here: NVIDIA developer site.

Compile and run inside a Fedora Toolbox Container

We also have a guide for setting up CUDA toolkit in a Fedora toolbox container.

Recommended for:
- Necessary for users of Atomic Desktops for Fedora; such as: Silverblue and Kinoite.
  - (there are no supported CUDA packages for these systems)
- Necessary for users that have a host that is not a: Supported Nvidia CUDA Release Platform.
  - (for example, you may have Fedora 42 Beta as your your host operating system)
- Convenient For those running Fedora Workstation or Fedora KDE Plasma Desktop, and want to keep their host system clean.
- Optionally toolbox packages are available: Arch Linux, Red Hat Enterprise Linux >= 8.5, or Ubuntu
HIP

This provides GPU acceleration on HIP-supported AMD GPUs. Make sure to have ROCm installed. You can download it from your Linux distro's package manager or from here: ROCm Quick Start (Linux).

Or you can try to build inside the ROCm docker container.
Vulkan

Install the Vulkan SDK and drivers for your platform.
- Linux: use your distro packages and/or the LunarG Vulkan SDK.
- Windows: install LunarG Vulkan SDK and vendor GPU drivers.
- macOS: Intel only; Apple Silicon is not supported for Vulkan in this project.

Build `xllamacpp`

A recent version of python3 (testing on python 3.12)
Install Rust toolchain (required for building):

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

For more installation options, see the rustup installation guide.

Git clone the latest version of xllamacpp:

git clone git@github.com:xorbitsai/xllamacpp.git
cd xllamacpp
git submodule init
git submodule update

Install dependencies of cython, setuptools, and pytest for testing:

pip install -r requirements.txt

Select backend via environment and build. Examples:

CPU (default):
```
make
```
CUDA:
```
export XLLAMACPP_BUILD_CUDA=1
make
```
HIP (AMD):
```
export XLLAMACPP_BUILD_HIP=1
make
```
Vulkan:
```
export XLLAMACPP_BUILD_VULKAN=1
make
```

Enable BLAS (optional):

export CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS"
make

Testing

The tests directory in this repo provides extensive examples of using xllamacpp.

However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. xllamacpp expects models to be stored in a models folder in the cloned xllamacpp directory. So to create the models directory if doesn't exist and download this model, you can just type:

make download

This basically just does:

cd xllamacpp
mkdir models && cd models
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf

Now you can test it using llama-cli or llama-simple:

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"

You can also run the test suite with pytest by typing pytest or:

make test

Project details

Release history Release notifications | RSS feed

2026.5.9093

May 10, 2026

2026.4.8929

Apr 27, 2026

2026.4.8672.1

Apr 9, 2026

This version

2026.4.8672

Apr 9, 2026

0.2.14

Mar 20, 2026

0.2.13

Mar 17, 2026

0.2.12

Feb 27, 2026

0.2.11

Feb 7, 2026

0.2.10

Jan 27, 2026

0.2.9

Jan 13, 2026

0.2.8

Dec 27, 2025

0.2.7

Dec 20, 2025

0.2.6

Nov 29, 2025

0.2.5

Nov 19, 2025

0.2.4

Nov 1, 2025

0.2.3

Oct 14, 2025

0.2.2

Sep 29, 2025

0.2.1

Sep 13, 2025

0.2.0

Aug 28, 2025

0.1.26

Aug 15, 2025

0.1.25

Aug 10, 2025

0.1.24

Jul 20, 2025

0.1.23

Jul 7, 2025

0.1.22

Jun 28, 2025

0.1.21

Jun 19, 2025

0.1.20

Jun 15, 2025

0.1.19

Jun 4, 2025

0.1.18

May 24, 2025

0.1.17

May 23, 2025

0.1.16

May 12, 2025

0.1.15

Apr 29, 2025

0.1.14

Apr 14, 2025

0.1.13

Apr 8, 2025

0.1.12

Apr 3, 2025

0.1.11

Mar 21, 2025

0.1.10

Mar 14, 2025

0.1.9

Mar 8, 2025

0.1.8

Mar 8, 2025

0.1.7

Mar 7, 2025

0.1.6

Mar 6, 2025

0.1.5

Mar 6, 2025

0.1.4

Mar 4, 2025

0.1.3

Mar 4, 2025

0.1.2

Mar 4, 2025

0.1.1

Mar 3, 2025

0.1.0

Mar 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xllamacpp-2026.4.8672.tar.gz (29.7 MB view details)

Uploaded Apr 9, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

xllamacpp-2026.4.8672-cp310-abi3-win_amd64.whl (4.6 MB view details)

Uploaded Apr 9, 2026 CPython 3.10+Windows x86-64

xllamacpp-2026.4.8672-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (27.4 MB view details)

Uploaded Apr 9, 2026 CPython 3.10+manylinux: glibc 2.17+ x86-64

xllamacpp-2026.4.8672-cp310-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.whl (26.6 MB view details)

Uploaded Apr 9, 2026 CPython 3.10+manylinux: glibc 2.17+ ARM64

xllamacpp-2026.4.8672-cp310-abi3-macosx_11_0_arm64.whl (7.1 MB view details)

Uploaded Apr 9, 2026 CPython 3.10+macOS 11.0+ ARM64

xllamacpp-2026.4.8672-cp310-abi3-macosx_10_9_x86_64.whl (7.2 MB view details)

Uploaded Apr 9, 2026 CPython 3.10+macOS 10.9+ x86-64

File details

Details for the file xllamacpp-2026.4.8672.tar.gz.

File metadata

Download URL: xllamacpp-2026.4.8672.tar.gz
Upload date: Apr 9, 2026
Size: 29.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for xllamacpp-2026.4.8672.tar.gz
Algorithm	Hash digest
SHA256	`aa412528d71ad73905b5111a26bec4c2d228c2e4d6220d20dff229934b2e93a3`
MD5	`f0e51a48ac46872c1f3206114fb9350a`
BLAKE2b-256	`98fb1125f8ff7e38f53a894f4409c5e7e844774c31a7c410a1fd481e318f4e6b`

See more details on using hashes here.

File details

Details for the file xllamacpp-2026.4.8672-cp310-abi3-win_amd64.whl.

File metadata

Download URL: xllamacpp-2026.4.8672-cp310-abi3-win_amd64.whl
Upload date: Apr 9, 2026
Size: 4.6 MB
Tags: CPython 3.10+, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for xllamacpp-2026.4.8672-cp310-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`e7304b9feb711c9cc90921df63d8b72011fc785afb8f0b36e94bd53942d16bcd`
MD5	`e4e6cfc183e7f0a85743a8a13ed897e0`
BLAKE2b-256	`f55fd54fb0de050507f00ec4c8d23424d68c1de246f8327b98ebee237328d388`

See more details on using hashes here.

File details

Details for the file xllamacpp-2026.4.8672-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

Download URL: xllamacpp-2026.4.8672-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Upload date: Apr 9, 2026
Size: 27.4 MB
Tags: CPython 3.10+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for xllamacpp-2026.4.8672-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm	Hash digest
SHA256	`1b1e4a3e7c61e6219fb072cf024551c35f80bcb4d6ab5f684b408a9f01b0d245`
MD5	`7198f1f97148f9dfe939ae9fa911ea8a`
BLAKE2b-256	`03d29424896ef79b75047950c491b0695dc1adaefa7e9951534ed2a847d3ce9c`

See more details on using hashes here.

File details

Details for the file xllamacpp-2026.4.8672-cp310-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.whl.

File metadata

Download URL: xllamacpp-2026.4.8672-cp310-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.whl
Upload date: Apr 9, 2026
Size: 26.6 MB
Tags: CPython 3.10+, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for xllamacpp-2026.4.8672-cp310-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.whl
Algorithm	Hash digest
SHA256	`6bc76bef49bb5c1b5ec1f90038dcb35f177a26d6d5197379aee998f2b354bc3b`
MD5	`6950f8ac5b40d05799b3d6b75ab21893`
BLAKE2b-256	`1c9228e0a7917b6a9fbec711d11354fefff56d2b3191131ab2b23dca6495a32d`

See more details on using hashes here.

File details

Details for the file xllamacpp-2026.4.8672-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: xllamacpp-2026.4.8672-cp310-abi3-macosx_11_0_arm64.whl
Upload date: Apr 9, 2026
Size: 7.1 MB
Tags: CPython 3.10+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for xllamacpp-2026.4.8672-cp310-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`05f821448e06500ac317135a352600ace904754a72029b4f3680f04fa4121b90`
MD5	`59f2f6655947578db41a9e0817218e0a`
BLAKE2b-256	`3f51011a4dbc36545d169f83f96891ca6ca84ce48fa2c2e283f3cef89999e8d0`

See more details on using hashes here.

File details

Details for the file xllamacpp-2026.4.8672-cp310-abi3-macosx_10_9_x86_64.whl.

File metadata

Download URL: xllamacpp-2026.4.8672-cp310-abi3-macosx_10_9_x86_64.whl
Upload date: Apr 9, 2026
Size: 7.2 MB
Tags: CPython 3.10+, macOS 10.9+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for xllamacpp-2026.4.8672-cp310-abi3-macosx_10_9_x86_64.whl
Algorithm	Hash digest
SHA256	`38e96526e7f919bf8cb4ee14d982911a6763faca0833dcc25dd7e6333030f7cb`
MD5	`69345803d1fd6e29decae449d8351e26`
BLAKE2b-256	`93c0a64c6a29fcce56a9f15ec95e0f37fba2c80ad3d245d0c668adcbc56c1e64`

See more details on using hashes here.

xllamacpp 2026.4.8672

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

xllamacpp - a Python wrapper of llama.cpp

Compare to llama-cpp-python

Wrapping Guidelines

Usage

OpenAI API Compatible HTTP Server

Prerequisites for Prebuilt Wheels

Install

Build from Source

(Optional) Preparation

Download directly from NVIDIA

Compile and run inside a Fedora Toolbox Container

Build xllamacpp

Testing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

Build `xllamacpp`