Python binding for llama.cpp using cffi
Project description
llama-cpp-cffi
Python 3.10+ binding for llama.cpp using cffi. Supports CPU, Vulkan 1.x (AMD, Intel and Nvidia GPUs) and CUDA 12.8 (Nvidia GPUs) runtimes, x86_64 (and soon aarch64) platforms.
NOTE: Currently supported operating system is Linux (manylinux_2_28 and musllinux_1_2), but we are working on both Windows and macOS versions.
News
- Mar 04 2025, v0.4.38: Conditional Structured Output using
CompletionsOptions.grammar_ignore_until - Feb 28 2025, v0.4.36: CUDA 12.8.0 for x86_64; CUDA ARCHITECTURES:
50; 61, 70, 75, 80, 86, 89, 90, 100, 101, 120 - Feb 17 2025, v0.4.21: CUDA 12.8.0 for x86_64; CUDA ARCHITECTURES:
61, 70, 75, 80, 86, 89, 90, 100, 101, 120 - Jan 15 2025, v0.4.15: Dynamically load/unload models while executing prompts in parallel.
- Jan 14 2025, v0.4.14: Modular llama.cpp build using
cmakebuild system. Deprecatedmakebuild system. - Jan 01 2025, v0.3.1: OpenAI compatible API, text and vision models. Added support for Qwen2-VL models. Hot-swap of models on demand in server/API.
- Dec 09 2024, v0.2.0: Low-level and high-level APIs: llama, llava, clip and ggml API.
- Nov 27 2024, v0.1.22: Support for Multimodal models such as llava and minicpmv.
Install
Basic library install:
pip install llama-cpp-cffi
In case you want OpenAI © Chat Completions API compatible API:
pip install llama-cpp-cffi[openai]
IMPORTANT: If you want to take advantage of Nvidia GPU acceleration, make sure that you have installed CUDA 12. If you don't have CUDA 12.X.Y installed follow instructions here: https://developer.nvidia.com/cuda-downloads .
GPU Compute Capability: 50;61;70;75;80;86;89;90;100;101;120 covering from most of GPUs from GeForce GTX 1050 to Nvidia H100 and Nvidia Blackwell. GPU Compute Capability.
LLM Example
from llama import Model
#
# first define and load/init model
#
model = Model(
creator_hf_repo='HuggingFaceTB/SmolLM2-1.7B-Instruct',
hf_repo='bartowski/SmolLM2-1.7B-Instruct-GGUF',
hf_file='SmolLM2-1.7B-Instruct-Q4_K_M.gguf',
)
model.init(n_ctx=8 * 1024, gpu_layers=99)
#
# messages
#
messages = [
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': '1 + 1 = ?'},
{'role': 'assistant', 'content': '2'},
{'role': 'user', 'content': 'Evaluate 1 + 2 in Python.'},
]
completions = model.completions(
messages=messages,
predict=1 * 1024,
temp=0.7,
top_p=0.8,
top_k=100,
)
for chunk in completions:
print(chunk, flush=True, end='')
#
# prompt
#
prompt='Evaluate 1 + 2 in Python. Result in Python is'
completions = model.completions(
prompt=prompt,
predict=1 * 1024,
temp=0.7,
top_p=0.8,
top_k=100,
)
for chunk in completions:
print(chunk, flush=True, end='')
References
examples/llm.pyexamples/demo_text.py
VLM Example
from llama import Model
#
# first define and load/init model
#
model = Model( # 1.87B
creator_hf_repo='vikhyatk/moondream2',
hf_repo='vikhyatk/moondream2',
hf_file='moondream2-text-model-f16.gguf',
mmproj_hf_file='moondream2-mmproj-f16.gguf',
)
model.init(n_ctx=8 * 1024, gpu_layers=99)
#
# prompt
#
prompt = 'Describe this image.'
image = 'examples/llama-1.png'
completions = model.completions(
prompt=prompt,
image=image,
predict=1 * 1024,
)
for chunk in completions:
print(chunk, flush=True, end='')
References
examples/vlm.pyexamples/demo_llava.pyexamples/demo_minicpmv.pyexamples/demo_qwen2vl.py
API
Server - llama-cpp-cffi + OpenAI API
Run server first:
python -B -u -m llama.server
# or
python -B -u -m gunicorn --bind '0.0.0.0:11434' --timeout 900 --workers 1 --worker-class aiohttp.GunicornWebWorker 'llama.server:build_app()'
Client - llama-cpp-cffi API / curl
#
# llm
#
curl -XPOST 'http://localhost:11434/api/1.0/completions' \
-H "Content-Type: application/json" \
-d '{
"gpu_layers": 99,
"prompt": "Evaluate 1 + 2 in Python."
}'
curl -XPOST 'http://localhost:11434/api/1.0/completions' \
-H "Content-Type: application/json" \
-d '{
"creator_hf_repo": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
"hf_repo": "bartowski/SmolLM2-1.7B-Instruct-GGUF",
"hf_file": "SmolLM2-1.7B-Instruct-Q4_K_M.gguf",
"gpu_layers": 99,
"prompt": "Evaluate 1 + 2 in Python."
}'
curl -XPOST 'http://localhost:11434/api/1.0/completions' \
-H "Content-Type: application/json" \
-d '{
"creator_hf_repo": "Qwen/Qwen2.5-0.5B-Instruct",
"hf_repo": "Qwen/Qwen2.5-0.5B-Instruct-GGUF",
"hf_file": "qwen2.5-0.5b-instruct-q4_k_m.gguf",
"gpu_layers": 99,
"prompt": "Evaluate 1 + 2 in Python."
}'
curl -XPOST 'http://localhost:11434/api/1.0/completions' \
-H "Content-Type: application/json" \
-d '{
"creator_hf_repo": "Qwen/Qwen2.5-7B-Instruct",
"hf_repo": "bartowski/Qwen2.5-7B-Instruct-GGUF",
"hf_file": "Qwen2.5-7B-Instruct-Q4_K_M.gguf",
"gpu_layers": 99,
"prompt": "Evaluate 1 + 2 in Python."
}'
#
# vlm - example 1
#
image_path="examples/llama-1.jpg"
mime_type=$(file -b --mime-type "$image_path")
base64_data=$(base64 -w 0 "$image_path")
cat << EOF > /tmp/temp.json
{
"creator_hf_repo": "Qwen/Qwen2-VL-2B-Instruct",
"hf_repo": "bartowski/Qwen2-VL-2B-Instruct-GGUF",
"hf_file": "Qwen2-VL-2B-Instruct-Q4_K_M.gguf",
"mmproj_hf_file": "mmproj-Qwen2-VL-2B-Instruct-f16.gguf",
"gpu_layers": 99,
"prompt": "Describe this image.",
"image": "data:$mime_type;base64,$base64_data"
}
EOF
curl -XPOST 'http://localhost:11434/api/1.0/completions' \
-H "Content-Type: application/json" \
--data-binary "@/tmp/temp.json"
#
# vlm - example 2
#
image_path="examples/llama-1.jpg"
mime_type=$(file -b --mime-type "$image_path")
base64_data=$(base64 -w 0 "$image_path")
cat << EOF > /tmp/temp.json
{
"creator_hf_repo": "Qwen/Qwen2-VL-2B-Instruct",
"hf_repo": "bartowski/Qwen2-VL-2B-Instruct-GGUF",
"hf_file": "Qwen2-VL-2B-Instruct-Q4_K_M.gguf",
"mmproj_hf_file": "mmproj-Qwen2-VL-2B-Instruct-f16.gguf",
"gpu_layers": 99,
"messages": [
{"role": "user", "content": [
{"type": "text", "text": "Describe this image."},
{
"type": "image_url",
"image_url": {"url": "data:$mime_type;base64,$base64_data"}
}
]}
]
}
EOF
curl -XPOST 'http://localhost:11434/api/1.0/completions' \
-H "Content-Type: application/json" \
--data-binary "@/tmp/temp.json"
Client - OpenAI © compatible Chat Completions API
#
# text
#
curl -XPOST 'http://localhost:11434/v1/chat/completions' \
-H "Content-Type: application/json" \
-d '{
"model": "HuggingFaceTB/SmolLM2-1.7B-Instruct:bartowski/SmolLM2-1.7B-Instruct-GGUF:SmolLM2-1.7B-Instruct-Q4_K_M.gguf",
"messages": [
{
"role": "user",
"content": "Evaluate 1 + 2 in Python."
}
],
"n_ctx": 8192,
"gpu_layers": 99
}'
#
# image
#
image_path="examples/llama-1.jpg"
mime_type=$(file -b --mime-type "$image_path")
base64_data=$(base64 -w 0 "$image_path")
cat << EOF > /tmp/temp.json
{
"model": "Qwen/Qwen2-VL-2B-Instruct:bartowski/Qwen2-VL-2B-Instruct-GGUF:Qwen2-VL-2B-Instruct-Q4_K_M.gguf:mmproj-Qwen2-VL-2B-Instruct-f16.gguf",
"messages": [
{"role": "user", "content": [
{"type": "text", "text": "Describe this image."},
{
"type": "image_url",
"image_url": {"url": "data:$mime_type;base64,$base64_data"}
}
]}
],
"n_ctx": 8192,
"gpu_layers": 99
}
EOF
curl -XPOST 'http://localhost:11434/v1/chat/completions' \
-H "Content-Type: application/json" \
--data-binary "@/tmp/temp.json"
#
# Client Python API for OpenAI
#
python -B examples/demo_openai.py
References
examples/demo_openai.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_cpp_cffi-0.4.44-cp312-cp312-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: llama_cpp_cffi-0.4.44-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 292.0 MB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.12.9 Linux/6.13.8-arch1-1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2bb61a5d2d8c9a758f4b863743d9d08a53417933824fa61b2681cbde97e8b555
|
|
| MD5 |
949d606ba87fd7eb0dec9a59c6d942cc
|
|
| BLAKE2b-256 |
c5be682330f5e929e8e989113c022218364d9fcd444ad35b536e1e029e4e1a20
|