Auto-synchronized Python bindings for llama.cpp using CFFI ABI mode
Project description
llama-cpp-py-sync
Auto-synchronized Python bindings for llama.cpp
Overview
llama-cpp-py-sync provides Python bindings for llama.cpp that are kept up-to-date automatically. It generates bindings from upstream headers using CFFI ABI mode, and ships prebuilt wheels.
Key Features
- Automatic upstream sync and binding regeneration
- Prebuilt wheels built by CI
- CPU wheels published to PyPI
- Backend-specific wheels (CUDA / Vulkan / Metal) published to GitHub Releases
- CI checks that the generated CFFI surface matches the upstream C API (functions, structs, enums, and signatures)
- A small, explicit Python API (
Llama.generate,tokenize,get_embeddings, etc.)
What You Get (and What You Don’t)
- This project binds to the public C API that llama.cpp exposes in
llama.h. - It does not attempt to bind llama.cpp’s internal C++ implementation such as private headers, C++ classes/templates, or functions that never appear in
llama.h. - We use CFFI ABI mode: Python loads a prebuilt shared library at runtime (no compiled Python extension module for the bindings).
- Because of that, you still need a compatible llama.cpp shared library available, either bundled in the wheel or via
LLAMA_CPP_LIB. - You get a small high-level API (
llama_cpp_py_sync.Llama) for common tasks, and an “escape hatch” to call the low-level C functions directly via CFFI when needed.
High-level vs Low-level APIs
- High-level API:
llama_cpp_py_sync.Llamais the recommended entry point for typical usage such as generation, tokenization, and embeddings.
import llama_cpp_py_sync as llama
with llama.Llama("path/to/model.gguf", n_ctx=2048, n_gpu_layers=0) as llm:
print(llm.generate("Hello", max_tokens=64))
- Low-level API:
llama_cpp_py_sync._cffi_bindingsexposes CFFI access to the underlying llama.cpp C API for advanced use.
from llama_cpp_py_sync._cffi_bindings import get_ffi, get_lib
ffi = get_ffi()
lib = get_lib()
print(ffi.string(lib.llama_print_system_info()).decode("utf-8", errors="replace"))
Installation
This project supports Python 3.8+. During the current testing phase, CI builds are pinned to Python 3.11.9 for reproducibility, but the published wheels are intended to work across supported Python versions.
From PyPI (Recommended)
pip install llama-cpp-py-sync
This installs the CPU wheel.
Note: depending on CI configuration and platform support, additional wheels may also be published to PyPI.
Quick Chat (Recommended)
After installing from PyPI, you can start an interactive chat session with:
python -m llama_cpp_py_sync chat
If you do not pass --model (and LLAMA_MODEL is not set), the CLI will prompt before downloading a default GGUF model and cache it locally for future runs.
To auto-download without prompting, pass --yes.
One-shot prompt:
python -m llama_cpp_py_sync chat --prompt "Say 'ok'." --max-tokens 32
Use a specific local model:
python -m llama_cpp_py_sync chat --model path/to/model.gguf
From GitHub Releases (Wheel)
Download the wheel for your platform/backend from GitHub Releases and install the .whl:
pip install path/to/llama_cpp_py_sync-*.whl
From Source
git clone https://github.com/FarisZahrani/llama-cpp-py-sync.git
cd llama-cpp-py-sync
# Sync upstream llama.cpp
python scripts/sync_upstream.py
# Regenerate CFFI bindings from the synced llama.cpp headers
# (Optional) record the exact llama.cpp commit SHA in the generated file.
python scripts/gen_bindings.py --commit-sha "$(python scripts/sync_upstream.py --sha)"
# Build the shared library
python scripts/build_llama_cpp.py
# Install the package
pip install -e .
vendor/llama.cpp is cloned locally by scripts/sync_upstream.py (and in CI during builds) and is not committed to this repository.
Quick Start
import llama_cpp_py_sync as llama
# Load a model
llm = llama.Llama("path/to/model.gguf", n_ctx=2048, n_gpu_layers=35)
# Generate text
response = llm.generate("Hello, world!", max_tokens=100)
print(response)
# Streaming generation
for token in llm.generate("Write a poem:", max_tokens=100, stream=True):
print(token, end="", flush=True)
# Clean up
llm.close()
Using Context Manager
with llama.Llama("model.gguf", n_gpu_layers=35) as llm:
print(llm.generate("Once upon a time"))
Embeddings
# Load an embedding model
with llama.Llama("embed-model.gguf", embedding=True) as llm:
emb = llm.get_embeddings("Hello, world!")
print(f"Embedding dimension: {len(emb)}")
Check Available Backends
from llama_cpp_py_sync import get_available_backends, get_backend_info
print(get_available_backends()) # ['cuda', 'blas'] or similar
info = get_backend_info()
print(f"CUDA available: {info.cuda}")
print(f"Metal available: {info.metal}")
Full API (click to expand)
import llama_cpp_py_sync as llama
# Versions
llama.__version__
llama.__llama_cpp_commit__
# Main class
llm = llama.Llama(
model_path="path/to/model.gguf",
n_ctx=512,
n_batch=512,
n_threads=None,
n_gpu_layers=0,
seed=-1,
use_mmap=True,
use_mlock=False,
verbose=False,
embedding=False,
)
text = llm.generate(
"Hello",
max_tokens=256,
temperature=0.8,
top_k=40,
top_p=0.95,
min_p=0.05,
repeat_penalty=1.1,
stop_sequences=None,
stream=False,
)
stream = llm.generate(
"Hello",
max_tokens=256,
stream=True,
)
tokens = llm.tokenize("Hello")
text = llm.detokenize(tokens)
piece = llm.token_to_piece(tokens[0])
llm.get_model_desc()
llm.get_model_size()
llm.get_model_n_params()
# Embeddings (requires embedding=True)
emb = llm.get_embeddings("Hello")
llm.close()
# Module-level embeddings helpers
llama.get_embeddings("path/to/model.gguf", "Hello")
llama.get_embeddings_batch("path/to/model.gguf", ["Hello", "World"])
# Backend helpers
llama.get_available_backends()
llama.get_backend_info()
llama.is_cuda_available()
llama.is_metal_available()
llama.is_vulkan_available()
llama.is_rocm_available()
llama.is_blas_available()
How It Works
Automatic Synchronization
- Scheduled Checks: GitHub Actions checks upstream llama.cpp on a schedule
- Tag Mirroring: When an upstream tag exists, the workflow can mirror it into this repository
- Wheel Building: CI builds wheels for all platforms/backends
- Release Publishing: GitHub Releases are created only for tags that exist upstream
- PyPI Publishing: CPU-only wheels are published to PyPI for upstream tags (if configured)
Bindings Validation (API Surface)
To keep the Python bindings aligned with upstream, CI runs a validation step that compares upstream llama.h to the generated CFFI cdef.
It checks:
- Public function coverage (missing/extra)
- Struct and enum coverage (missing fields/members)
- Function signatures (return + parameter types)
Local run (after syncing upstream headers):
python scripts/sync_upstream.py
python scripts/gen_bindings.py --commit-sha "$(python scripts/sync_upstream.py --sha)"
python scripts/validate_cffi_surface.py --check-structs --check-enums --check-signatures
CFFI ABI Mode
Unlike pybind11 or manual ctypes, CFFI ABI mode:
- Reads C declarations directly (no compilation needed for bindings)
- Loads the shared library at runtime via
ffi.dlopen() - Automatically handles type conversions
- Works across platforms without modification
Version Tracking
Check which llama.cpp version you're running:
import llama_cpp_py_sync as llama
print(f"Package version: {llama.__version__}")
print(f"llama.cpp commit: {llama.__llama_cpp_commit__}")
print(f"llama.cpp tag: {getattr(llama, '__llama_cpp_tag__', '')}")
GPU Backend Selection
Build-time Detection
The build system automatically detects available backends:
| Backend | Platform | Detection |
|---|---|---|
| CUDA | Linux, Windows | CUDA_HOME or /usr/local/cuda |
| ROCm | Linux | ROCM_PATH or /opt/rocm |
| Metal | macOS | Xcode SDK |
| Vulkan | All | VULKAN_SDK environment variable |
| BLAS | All | OpenBLAS, MKL, or Accelerate |
Runtime Configuration
# Use GPU acceleration
llm = llama.Llama("model.gguf", n_gpu_layers=35)
# CPU only (no GPU offload)
llm = llama.Llama("model.gguf", n_gpu_layers=0)
# Full GPU offload (all layers)
llm = llama.Llama("model.gguf", n_gpu_layers=-1)
API Reference
Llama Class
class Llama:
def __init__(
self,
model_path: str,
n_ctx: int = 512, # Context window size
n_batch: int = 512, # Batch size for prompt processing
n_threads: int = None, # CPU threads (auto-detect if None)
n_gpu_layers: int = 0, # Layers to offload to GPU
seed: int = -1, # Random seed (-1 for random)
use_mmap: bool = True, # Memory map model file
use_mlock: bool = False, # Lock model in RAM
verbose: bool = False, # Print loading info
embedding: bool = False, # Enable embedding mode
): ...
def generate(
self,
prompt: str,
max_tokens: int = 256,
temperature: float = 0.8,
top_k: int = 40,
top_p: float = 0.95,
min_p: float = 0.05,
repeat_penalty: float = 1.1,
stop_sequences: List[str] = None,
stream: bool = False,
) -> Union[str, Iterator[str]]: ...
def tokenize(self, text: str, add_special: bool = True) -> List[int]: ...
def detokenize(self, tokens: List[int]) -> str: ...
def get_embeddings(self, text: str) -> List[float]: ...
def close(self): ...
Backend Functions
def get_available_backends() -> List[str]: ...
def get_backend_info() -> BackendInfo: ...
def is_cuda_available() -> bool: ...
def is_metal_available() -> bool: ...
def is_vulkan_available() -> bool: ...
def is_rocm_available() -> bool: ...
def is_blas_available() -> bool: ...
Embedding Functions
def get_embeddings(model: Union[str, Llama], text: str) -> List[float]: ...
def get_embeddings_batch(model: Union[str, Llama], texts: List[str]) -> List[List[float]]: ...
def cosine_similarity(a: List[float], b: List[float]) -> float: ...
Examples
See the examples/ directory:
basic_generation.py- Simple text generationstreaming_generation.py- Real-time token streamingembeddings_example.py- Generate and compare embeddingsbackend_info.py- Check available GPU backendsbenchmark.py- Measure token throughput
Smoke Test / Chat CLI
This repository includes an interactive smoke test that can run either as a one-shot prompt (CI-friendly) or as a back-and-forth chat.
# Interactive chat (Ctrl+C or blank line to exit)
python -m llama_cpp_py_sync chat
# One-shot prompt
python -m llama_cpp_py_sync chat --prompt "Say 'ok'." --max-tokens 16
# Use a specific model
python -m llama_cpp_py_sync chat --model path/to/model.gguf
By default it uses LLAMA_MODEL if set. Otherwise it downloads a default GGUF model and caches it locally.
If the default model is missing, the CLI will prompt before downloading it. To auto-download without prompting, pass --yes.
Model cache location:
- Windows:
%LOCALAPPDATA%\llama-cpp-py-sync\models\ - Linux/macOS:
~/.cache/llama-cpp-py-sync/models/
Building from Source
Prerequisites
- Python 3.8+
- Ninja
- CMake (configure step)
- C/C++ compiler (GCC, Clang, MSVC)
- Git
Build Commands
# Clone repository
git clone https://github.com/FarisZahrani/llama-cpp-py-sync.git
cd llama-cpp-py-sync
# Sync upstream llama.cpp
python scripts/sync_upstream.py
# Regenerate bindings from the synced llama.cpp headers
# (Optional) record the exact llama.cpp commit SHA in the generated file.
python scripts/gen_bindings.py --commit-sha "$(python scripts/sync_upstream.py --sha)"
# Build with auto-detected backends
python scripts/build_llama_cpp.py
# Build a specific backend
python scripts/build_llama_cpp.py --backend cuda
python scripts/build_llama_cpp.py --backend vulkan
python scripts/build_llama_cpp.py --backend cpu
# On Windows, the build script bundles required runtime DLLs (MSVC/OpenMP and backend runtimes)
# next to the built library by default. You can disable this behavior with:
python scripts/build_llama_cpp.py --no-bundle-runtime-dlls
# Detect available backends without building
python scripts/build_llama_cpp.py --detect-only
# Build wheel
pip install build
python -m build --wheel
Low-level C API access (advanced)
If you need direct access to the underlying C API (beyond the high-level Llama wrapper), you can use the generated CFFI bindings:
from llama_cpp_py_sync._cffi_bindings import get_ffi, get_lib
ffi = get_ffi()
lib = get_lib()
print(ffi.string(lib.llama_print_system_info()).decode("utf-8", errors="replace"))
Project Structure
llama-cpp-py-sync/
├── src/llama_cpp_py_sync/ # Python package
│ ├── __init__.py # Public API
│ ├── _cffi_bindings.py # Auto-generated CFFI bindings
│ ├── _version.py # Version info
│ ├── llama.py # High-level Llama class
│ ├── embeddings.py # Embedding utilities
│ └── backends.py # Backend detection
├── scripts/ # Build and sync scripts
│ ├── sync_upstream.py # Sync upstream llama.cpp
│ ├── gen_bindings.py # Generate CFFI bindings
│ ├── build_llama_cpp.py # Build shared library
│ └── auto_version.py # Version generation
├── examples/ # Example scripts
├── vendor/llama.cpp/ # Upstream source (cloned at build time)
├── .github/workflows/ # CI/CD pipelines
├── pyproject.toml # Package metadata
└── README.md # This file
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Run checks:
python scripts/run_tests.py
Optionally also verify wheel packaging locally:
python scripts/run_tests.py
- Submit a pull request
License
MIT License - see LICENSE for details.
This project uses llama.cpp which is also MIT licensed.
Third-party license notices are included in THIRD_PARTY_NOTICES.txt.
Acknowledgments
- ggml-org/llama.cpp - The upstream C/C++ implementation
- CFFI - C Foreign Function Interface for Python
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_cpp_py_sync-0.8660-1metal-py3-none-macosx_14_0_arm64.whl.
File metadata
- Download URL: llama_cpp_py_sync-0.8660-1metal-py3-none-macosx_14_0_arm64.whl
- Upload date:
- Size: 5.0 MB
- Tags: Python 3, macOS 14.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e63e5e124bad1216cd610d19cca589d99e24793e99cc6dd3ec1ab96cd030b1f
|
|
| MD5 |
38d5b59d8541c92345d1886aca872cd3
|
|
| BLAKE2b-256 |
d7907cbf392585af0c7fe26559c601e4b08b2626eb17fcd6277cc6a02d57950d
|
File details
Details for the file llama_cpp_py_sync-0.8660-1cpu-py3-none-win_amd64.whl.
File metadata
- Download URL: llama_cpp_py_sync-0.8660-1cpu-py3-none-win_amd64.whl
- Upload date:
- Size: 2.0 MB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d038c5deba65b404ec5583040f5df6f055c5943072cfb83668b09d71d0ab2b4
|
|
| MD5 |
8a73df5df9622a12275ee86b587ccad2
|
|
| BLAKE2b-256 |
fbcda3b8ad92bb2962c234ebbdd88ff1c5af26a85f649155ab7e4e9beeae9d64
|
File details
Details for the file llama_cpp_py_sync-0.8660-1cpu-py3-none-manylinux2014_x86_64.whl.
File metadata
- Download URL: llama_cpp_py_sync-0.8660-1cpu-py3-none-manylinux2014_x86_64.whl
- Upload date:
- Size: 3.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ef9d51dbf3ef55a512f6f442416193529afb49eedcc7f576f276e290520a141
|
|
| MD5 |
3dac28333a52b3eeb0c8a2c22b290da0
|
|
| BLAKE2b-256 |
e0959e5f797c01197848dfda74312ec1fe5b07cb27c0804362199a5d7f7b0022
|
File details
Details for the file llama_cpp_py_sync-0.8660-1cpu-py3-none-macosx_14_0_x86_64.whl.
File metadata
- Download URL: llama_cpp_py_sync-0.8660-1cpu-py3-none-macosx_14_0_x86_64.whl
- Upload date:
- Size: 5.0 MB
- Tags: Python 3, macOS 14.0+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a1df24f12eea55f9a94bb6a296cecfddada801c14f741972cc52d32c1b3dd5a
|
|
| MD5 |
a10050d90e69f22ebae33f2373fa838f
|
|
| BLAKE2b-256 |
a825bed5af6415d7d34c1e4c82936121cb9f3b752dc7d43f3aba34659b854452
|