Auto-synchronized Python bindings for llama.cpp using CFFI ABI mode

These details have not been verified by PyPI

Project links

Project description

llama-cpp-py-sync

Auto-synchronized Python bindings for llama.cpp

Overview

llama-cpp-py-sync provides Python bindings for llama.cpp that are kept up-to-date automatically. It generates bindings from upstream headers using CFFI ABI mode, and ships prebuilt wheels.

Key Features

Automatic upstream sync and binding regeneration
Prebuilt wheels built by CI
CPU wheels published to PyPI
Backend-specific wheels published to GitHub Releases: Linux CUDA (12.2) and Vulkan, Windows CUDA (12.4) and Vulkan, macOS Apple Silicon Metal
CI checks that the generated CFFI surface matches the upstream C API (functions, structs, enums, and signatures)
A small, explicit Python API (Llama.generate, tokenize, get_embeddings, etc.)

What You Get (and What You Don’t)

This project binds to the public C API that llama.cpp exposes in llama.h.
It does not attempt to bind llama.cpp’s internal C++ implementation such as private headers, C++ classes/templates, or functions that never appear in llama.h.
We use CFFI ABI mode: Python loads a prebuilt shared library at runtime (no compiled Python extension module for the bindings).
Because of that, you still need a compatible llama.cpp shared library available, either bundled in the wheel or via LLAMA_CPP_LIB.
You get a small high-level API (llama_cpp_py_sync.Llama) for common tasks, and an “escape hatch” to call the low-level C functions directly via CFFI when needed.

High-level vs Low-level APIs

High-level API: llama_cpp_py_sync.Llama is the recommended entry point for typical usage such as generation, tokenization, and embeddings.

import llama_cpp_py_sync as llama

with llama.Llama("path/to/model.gguf", n_ctx=2048, n_gpu_layers=0) as llm:
    print(llm.generate("Hello", max_tokens=64))

Low-level API: llama_cpp_py_sync._cffi_bindings exposes CFFI access to the underlying llama.cpp C API for advanced use.

from llama_cpp_py_sync._cffi_bindings import get_ffi, get_lib

ffi = get_ffi()
lib = get_lib()

print(ffi.string(lib.llama_print_system_info()).decode("utf-8", errors="replace"))

Installation

This project supports Python 3.8 through 3.14. CI builds wheels with Python 3.13.13 for reproducibility; the published wheels are intended to work across supported Python versions.

From PyPI (Recommended)

pip install llama-cpp-py-sync

This installs the CPU wheel.

Note: depending on CI configuration and platform support, additional wheels may also be published to PyPI.

Quick Chat (Recommended)

After installing from PyPI, you can start an interactive chat session with:

python -m llama_cpp_py_sync chat

If you do not pass --model (and LLAMA_MODEL is not set), the CLI will prompt before downloading a default GGUF model and cache it locally for future runs.

To auto-download without prompting, pass --yes.

One-shot prompt:

python -m llama_cpp_py_sync chat --prompt "Say 'ok'." --max-tokens 32

Use a specific local model:

python -m llama_cpp_py_sync chat --model path/to/model.gguf

From GitHub Releases (Wheel)

Download the wheel for your platform/backend from GitHub Releases and install the .whl:

pip install path/to/llama_cpp_py_sync-*.whl

From Source

git clone https://github.com/FarisZahrani/llama-cpp-py-sync.git
cd llama-cpp-py-sync

# Sync upstream llama.cpp
python scripts/sync_upstream.py

# Regenerate CFFI bindings from the synced llama.cpp headers
# (Optional) record the exact llama.cpp commit SHA in the generated file.
python scripts/gen_bindings.py --commit-sha "$(python scripts/sync_upstream.py --sha)"

# Build the shared library
python scripts/build_llama_cpp.py

# Install the package
pip install -e .

vendor/llama.cpp is cloned locally by scripts/sync_upstream.py (and in CI during builds) and is not committed to this repository.

Quick Start

import llama_cpp_py_sync as llama

# Load a model
llm = llama.Llama("path/to/model.gguf", n_ctx=2048, n_gpu_layers=35)

# Generate text
response = llm.generate("Hello, world!", max_tokens=100)
print(response)

# Streaming generation
for token in llm.generate("Write a poem:", max_tokens=100, stream=True):
    print(token, end="", flush=True)

# Clean up
llm.close()

Using Context Manager

with llama.Llama("model.gguf", n_gpu_layers=35) as llm:
    print(llm.generate("Once upon a time"))

Embeddings

# Load an embedding model
with llama.Llama("embed-model.gguf", embedding=True) as llm:
    emb = llm.get_embeddings("Hello, world!")
    print(f"Embedding dimension: {len(emb)}")

Check Available Backends

from llama_cpp_py_sync import get_available_backends, get_backend_info

print(get_available_backends())  # ['cuda', 'blas'] or similar

info = get_backend_info()
print(f"CUDA available: {info.cuda}")
print(f"Metal available: {info.metal}")

Full API (click to expand)

import llama_cpp_py_sync as llama

# Versions
llama.__version__
llama.__llama_cpp_commit__

# Main class
llm = llama.Llama(
    model_path="path/to/model.gguf",
    n_ctx=512,
    n_batch=512,
    n_threads=None,
    n_gpu_layers=0,
    n_ubatch=None,
    n_threads_batch=None,
    seed=-1,
    use_mmap=True,
    use_mlock=False,
    verbose=False,
    embedding=False,
    flash_attn_type=None,
)

text = llm.generate(
    "Hello",
    max_tokens=256,
    temperature=0.8,
    top_k=40,
    top_p=0.95,
    min_p=0.05,
    repeat_penalty=1.1,
    repeat_last_n=64,
    stop_sequences=None,
    stream=False,
    seed=None,
)

stream = llm.generate(
    "Hello",
    max_tokens=256,
    stream=True,
)

tokens = llm.tokenize("Hello", add_special=True, parse_special=False)
text = llm.detokenize(tokens, remove_special=False, unparse_special=True)
piece = llm.token_to_piece(tokens[0])

llm.get_model_desc()
llm.get_model_size()
llm.get_model_n_params()

# Properties
llm.n_vocab
llm.n_ctx
llm.n_embd
llm.n_layer
llm.bos_token
llm.eos_token

# Embeddings (requires embedding=True)
emb = llm.get_embeddings("Hello")

llm.close()

# Module-level embeddings helpers
llama.get_embeddings("path/to/model.gguf", "Hello")
llama.get_embeddings_batch("path/to/model.gguf", ["Hello", "World"])

# Backend helpers
llama.get_available_backends()
llama.get_backend_info()
llama.is_cuda_available()
llama.is_metal_available()
llama.is_vulkan_available()
llama.is_rocm_available()
llama.is_blas_available()

How It Works

Automatic Synchronization

Scheduled Checks: GitHub Actions checks upstream llama.cpp on a schedule
Tag Mirroring: When an upstream tag exists, the workflow can mirror it into this repository
Wheel Building: CI builds wheels for all platforms/backends
Release Publishing: GitHub Releases are created only for tags that exist upstream
PyPI Publishing: CPU-only wheels are published to PyPI for upstream tags (if configured)

Bindings Validation (API Surface)

To keep the Python bindings aligned with upstream, CI runs a validation step that compares upstream llama.h to the generated CFFI cdef.

It checks:

Public function coverage (missing/extra)
Struct and enum coverage (missing fields/members)
Function signatures (return + parameter types)

Local run (after syncing upstream headers):

python scripts/sync_upstream.py
python scripts/gen_bindings.py --commit-sha "$(python scripts/sync_upstream.py --sha)"
python scripts/validate_cffi_surface.py --check-structs --check-enums --check-signatures

CFFI ABI Mode

Unlike pybind11 or manual ctypes, CFFI ABI mode:

Reads C declarations directly (no compilation needed for bindings)
Loads the shared library at runtime via ffi.dlopen()
Automatically handles type conversions
Works across platforms without modification

Version Tracking

Check which llama.cpp version you're running:

import llama_cpp_py_sync as llama

print(f"Package version: {llama.__version__}")
print(f"llama.cpp commit: {llama.__llama_cpp_commit__}")
print(f"llama.cpp tag: {getattr(llama, '__llama_cpp_tag__', '')}")

GPU Backend Selection

Build-time Detection

The build system automatically detects available backends:

Backend	Platform	Detection
CUDA	Linux, Windows	`CUDA_HOME` or `/usr/local/cuda`
ROCm	Linux	`ROCM_PATH` or `/opt/rocm`
Metal	macOS	Xcode SDK
Vulkan	All	`VULKAN_SDK` environment variable
BLAS	All	OpenBLAS, MKL, or Accelerate

Runtime Configuration

# Use GPU acceleration
llm = llama.Llama("model.gguf", n_gpu_layers=35)

# CPU only (no GPU offload)
llm = llama.Llama("model.gguf", n_gpu_layers=0)

# Full GPU offload (all layers)
llm = llama.Llama("model.gguf", n_gpu_layers=-1)

API Reference

Llama Class

class Llama:
    def __init__(
        self,
        model_path: str,
        n_ctx: int = 512,                   # Context window size
        n_batch: int = 512,                 # Logical max batch size for prompt processing
        n_threads: int = None,              # CPU threads (auto-detect if None)
        n_gpu_layers: int = 0,              # Layers to offload to GPU
        n_ubatch: int = None,               # Physical microbatch size (defaults to n_batch)
        n_threads_batch: int = None,        # Threads for batch processing (defaults to n_threads)
        seed: int = -1,                     # Random seed (-1 for random)
        use_mmap: bool = True,              # Memory map model file
        use_mlock: bool = False,            # Lock model in RAM
        verbose: bool = False,              # Print loading info
        embedding: bool = False,            # Enable embedding mode
        flash_attn_type: int = None,        # Flash attention type (None = use env var)
    ): ...

    def generate(
        self,
        prompt: str,
        max_tokens: int = 256,
        temperature: float = 0.8,
        top_k: int = 40,
        top_p: float = 0.95,
        min_p: float = 0.05,
        repeat_penalty: float = 1.1,
        repeat_last_n: int = 64,
        stop_sequences: List[str] = None,
        stream: bool = False,
        seed: int = None,
    ) -> Union[str, Iterator[str]]: ...

    def tokenize(self, text: str, add_special: bool = True, parse_special: bool = False) -> List[int]: ...
    def detokenize(self, tokens: List[int], remove_special: bool = False, unparse_special: bool = True) -> str: ...
    def token_to_piece(self, token: int) -> str: ...
    def get_embeddings(self, text: str) -> List[float]: ...
    def get_model_desc(self) -> str: ...
    def get_model_size(self) -> int: ...
    def get_model_n_params(self) -> int: ...
    def close(self): ...

    # Properties
    n_vocab: int
    n_ctx: int
    n_embd: int
    n_layer: int
    bos_token: int
    eos_token: int

Backend Functions

def get_available_backends() -> List[str]: ...
def get_backend_info() -> BackendInfo: ...
def is_cuda_available() -> bool: ...
def is_metal_available() -> bool: ...
def is_vulkan_available() -> bool: ...
def is_rocm_available() -> bool: ...
def is_blas_available() -> bool: ...

Embedding Functions

def get_embeddings(model: Union[str, Llama], text: str) -> List[float]: ...
def get_embeddings_batch(model: Union[str, Llama], texts: List[str]) -> List[List[float]]: ...

Examples

See the examples/ directory:

basic_generation.py - Simple text generation
streaming_generation.py - Real-time token streaming
embeddings_example.py - Generate and compare embeddings
backend_info.py - Check available GPU backends
benchmark.py - Measure token throughput

Smoke Test / Chat CLI

This repository includes an interactive smoke test that can run either as a one-shot prompt (CI-friendly) or as a back-and-forth chat.

# Interactive chat (Ctrl+C or blank line to exit)
python -m llama_cpp_py_sync chat

# One-shot prompt
python -m llama_cpp_py_sync chat --prompt "Say 'ok'." --max-tokens 16

# Use a specific model
python -m llama_cpp_py_sync chat --model path/to/model.gguf

By default it uses LLAMA_MODEL if set. Otherwise it downloads a default GGUF model and caches it locally.

If the default model is missing, the CLI will prompt before downloading it. To auto-download without prompting, pass --yes.

Model cache location:

Windows: %LOCALAPPDATA%\llama-cpp-py-sync\models\
Linux/macOS: ~/.cache/llama-cpp-py-sync/models/

Building from Source

Prerequisites

Python 3.8+
Ninja
CMake (configure step)
C/C++ compiler (GCC, Clang, MSVC)
Git

Build Commands

# Clone repository
git clone https://github.com/FarisZahrani/llama-cpp-py-sync.git
cd llama-cpp-py-sync

# Sync upstream llama.cpp
python scripts/sync_upstream.py

# Regenerate bindings from the synced llama.cpp headers
# (Optional) record the exact llama.cpp commit SHA in the generated file.
python scripts/gen_bindings.py --commit-sha "$(python scripts/sync_upstream.py --sha)"

# Build with auto-detected backends
python scripts/build_llama_cpp.py

# Build a specific backend
python scripts/build_llama_cpp.py --backend cuda
python scripts/build_llama_cpp.py --backend vulkan
python scripts/build_llama_cpp.py --backend cpu

# On Windows, the build script bundles required runtime DLLs (MSVC/OpenMP and backend runtimes)
# next to the built library by default. You can disable this behavior with:
python scripts/build_llama_cpp.py --no-bundle-runtime-dlls

# Detect available backends without building
python scripts/build_llama_cpp.py --detect-only

# Build wheel
pip install build
python -m build --wheel

Low-level C API access (advanced)

If you need direct access to the underlying C API (beyond the high-level Llama wrapper), you can use the generated CFFI bindings:

from llama_cpp_py_sync._cffi_bindings import get_ffi, get_lib

ffi = get_ffi()
lib = get_lib()

print(ffi.string(lib.llama_print_system_info()).decode("utf-8", errors="replace"))

Project Structure

llama-cpp-py-sync/
├── src/llama_cpp_py_sync/      # Python package
│   ├── __init__.py             # Public API
│   ├── _cffi_bindings.py       # Auto-generated CFFI bindings
│   ├── _version.py             # Version info
│   ├── llama.py                # High-level Llama class
│   ├── embeddings.py           # Embedding utilities
│   └── backends.py             # Backend detection
├── scripts/                     # Build and sync scripts
│   ├── sync_upstream.py        # Sync upstream llama.cpp
│   ├── gen_bindings.py         # Generate CFFI bindings
│   ├── build_llama_cpp.py      # Build shared library
│   └── auto_version.py         # Version generation
├── examples/                    # Example scripts
├── vendor/llama.cpp/           # Upstream source (cloned at build time)
├── .github/workflows/          # CI/CD pipelines
├── pyproject.toml              # Package metadata
└── README.md                   # This file

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Run checks:

python scripts/run_tests.py

Optionally also verify wheel packaging locally:

python scripts/run_tests.py

Submit a pull request

License

MIT License - see LICENSE for details.

This project uses llama.cpp which is also MIT licensed.

Third-party license notices are included in THIRD_PARTY_NOTICES.txt.

Acknowledgments

ggml-org/llama.cpp - The upstream C/C++ implementation
CFFI - C Foreign Function Interface for Python

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9209

May 18, 2026

This version

0.9102

May 11, 2026

0.9014

May 4, 2026

0.8943

Apr 27, 2026

0.8881

Apr 22, 2026

0.8840

Apr 19, 2026

0.8808

Apr 16, 2026

0.8771

Apr 13, 2026

0.8739

Apr 10, 2026

0.8683

Apr 7, 2026

0.8660

Apr 4, 2026

0.8606

Apr 1, 2026

0.8589

Mar 31, 2026

0.8560

Mar 28, 2026

0.8508

Mar 25, 2026

0.8468

Mar 22, 2026

0.8416

Mar 19, 2026

0.8368

Mar 16, 2026

0.8305

Mar 13, 2026

0.8233

Mar 7, 2026

0.8219

Mar 7, 2026

0.8192

Mar 4, 2026

0.8184

Mar 1, 2026

0.8147

Feb 25, 2026

0.8123

Feb 22, 2026

0.8095

Feb 19, 2026

0.8067

Feb 16, 2026

0.8054

Feb 15, 2026

0.7830

Jan 25, 2026

0.7826

Jan 25, 2026

0.7825

Jan 25, 2026

0.7823

Jan 24, 2026

0.7822

Jan 24, 2026

0.7819

Jan 23, 2026

0.7815

Jan 23, 2026

0.7814

Jan 23, 2026

0.7813

Jan 23, 2026

0.7807

Jan 22, 2026

0.7802

Jan 22, 2026

0.7801

Jan 22, 2026

0.7795

Jan 22, 2026

0.7789

Jan 21, 2026

0.7787

Jan 21, 2026

0.7786

Jan 21, 2026

0.7779

Jan 20, 2026

0.7772

Jan 18, 2026

0.7770

Jan 18, 2026

0.7769

Jan 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llama_cpp_py_sync-0.9102-1metal-py3-none-macosx_14_0_arm64.whl (5.1 MB view details)

Uploaded May 11, 2026 Python 3macOS 14.0+ ARM64

llama_cpp_py_sync-0.9102-1cpu-py3-none-win_amd64.whl (2.0 MB view details)

Uploaded May 11, 2026 Python 3Windows x86-64

llama_cpp_py_sync-0.9102-1cpu-py3-none-manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded May 11, 2026 Python 3

llama_cpp_py_sync-0.9102-1cpu-py3-none-macosx_14_0_x86_64.whl (5.1 MB view details)

Uploaded May 11, 2026 Python 3macOS 14.0+ x86-64

File details

Details for the file llama_cpp_py_sync-0.9102-1metal-py3-none-macosx_14_0_arm64.whl.

File metadata

Download URL: llama_cpp_py_sync-0.9102-1metal-py3-none-macosx_14_0_arm64.whl
Upload date: May 11, 2026
Size: 5.1 MB
Tags: Python 3, macOS 14.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llama_cpp_py_sync-0.9102-1metal-py3-none-macosx_14_0_arm64.whl
Algorithm	Hash digest
SHA256	`3c0a3343710c38ac6b9227a691ea42fd50a6dc78d07e52f3d8e5a82a01abcf50`
MD5	`c3414ba179b090b2f0f668240a6fc680`
BLAKE2b-256	`207ce5445a44ec7fedd7bfa64c391ad43c4ad21ac11e40a7d03d93e6261954c6`

See more details on using hashes here.

File details

Details for the file llama_cpp_py_sync-0.9102-1cpu-py3-none-win_amd64.whl.

File metadata

Download URL: llama_cpp_py_sync-0.9102-1cpu-py3-none-win_amd64.whl
Upload date: May 11, 2026
Size: 2.0 MB
Tags: Python 3, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llama_cpp_py_sync-0.9102-1cpu-py3-none-win_amd64.whl
Algorithm	Hash digest
SHA256	`4112f13ae5529ebf6d99d99d19a5edea94a0359a374f9b11a8c2f24005b27308`
MD5	`5fb2c662e3ddad81437f6cb92774891f`
BLAKE2b-256	`57d4b2395eaf1b3bc1033e877f106373630c67632a7f982966192886f9bc110c`

See more details on using hashes here.

File details

Details for the file llama_cpp_py_sync-0.9102-1cpu-py3-none-manylinux2014_x86_64.whl.

File metadata

Download URL: llama_cpp_py_sync-0.9102-1cpu-py3-none-manylinux2014_x86_64.whl
Upload date: May 11, 2026
Size: 3.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llama_cpp_py_sync-0.9102-1cpu-py3-none-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`0a15b0fe0da1647c187e7f60020c911cd913fd674cd920010819703ee1b70ae7`
MD5	`4ff4acff9d1910c5b5114be7b48431de`
BLAKE2b-256	`c98fefdfa511b718b15f45cd613e8de850ab71ebb8f2d34d13941e8e200a20af`

See more details on using hashes here.

File details

Details for the file llama_cpp_py_sync-0.9102-1cpu-py3-none-macosx_14_0_x86_64.whl.

File metadata

Download URL: llama_cpp_py_sync-0.9102-1cpu-py3-none-macosx_14_0_x86_64.whl
Upload date: May 11, 2026
Size: 5.1 MB
Tags: Python 3, macOS 14.0+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llama_cpp_py_sync-0.9102-1cpu-py3-none-macosx_14_0_x86_64.whl
Algorithm	Hash digest
SHA256	`2ef8f0fe303dcb39c2f97717a1bb4b67cf818a3db4a017511efeaa911a063118`
MD5	`db9f2143d6aac9a3ac5b6b3620ceb5bc`
BLAKE2b-256	`96f99df3b43bc2beac1eecf2c4304cf93f5d90e1109c7a2eec406691890e3b2f`

See more details on using hashes here.

llama-cpp-py-sync 0.9102

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llama-cpp-py-sync

Overview

Key Features

What You Get (and What You Don’t)

High-level vs Low-level APIs

Installation

From PyPI (Recommended)

Quick Chat (Recommended)

From GitHub Releases (Wheel)

From Source

Quick Start

Using Context Manager

Embeddings

Check Available Backends

How It Works

Automatic Synchronization

Bindings Validation (API Surface)

CFFI ABI Mode

Version Tracking

GPU Backend Selection

Build-time Detection

Runtime Configuration

API Reference

Llama Class

Backend Functions

Embedding Functions

Examples

Smoke Test / Chat CLI

Building from Source

Prerequisites

Build Commands

Low-level C API access (advanced)

Project Structure

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes