Python binding for llama.cpp using cffi

These details have not been verified by PyPI

Project links

Project description

llama-cpp-cffi

Python 3.10+ binding for llama.cpp using cffi. Supports CPU, Vulkan 1.x (AMD, Intel and Nvidia GPUs) and CUDA 12.8 (Nvidia GPUs) runtimes, x86_64 (and soon aarch64) platforms.

NOTE: Currently supported operating system is Linux (manylinux_2_28 and musllinux_1_2), but we are working on both Windows and macOS versions.

News

Mar 04 2025, v0.4.38: Conditional Structured Output using CompletionsOptions.grammar_ignore_until
Feb 28 2025, v0.4.36: CUDA 12.8.0 for x86_64; CUDA ARCHITECTURES: 50; 61, 70, 75, 80, 86, 89, 90, 100, 101, 120
Feb 17 2025, v0.4.21: CUDA 12.8.0 for x86_64; CUDA ARCHITECTURES: 61, 70, 75, 80, 86, 89, 90, 100, 101, 120
Jan 15 2025, v0.4.15: Dynamically load/unload models while executing prompts in parallel.
Jan 14 2025, v0.4.14: Modular llama.cpp build using cmake build system. Deprecated make build system.
Jan 01 2025, v0.3.1: OpenAI compatible API, text and vision models. Added support for Qwen2-VL models. Hot-swap of models on demand in server/API.
Dec 09 2024, v0.2.0: Low-level and high-level APIs: llama, llava, clip and ggml API.
Nov 27 2024, v0.1.22: Support for Multimodal models such as llava and minicpmv.

Install

Basic library install:

pip install llama-cpp-cffi

In case you want OpenAI © Chat Completions API compatible API:

pip install llama-cpp-cffi[openai]

IMPORTANT: If you want to take advantage of Nvidia GPU acceleration, make sure that you have installed CUDA 12. If you don't have CUDA 12.X.Y installed follow instructions here: https://developer.nvidia.com/cuda-downloads .

GPU Compute Capability: 50;61;70;75;80;86;89;90;100;101;120 covering from most of GPUs from GeForce GTX 1050 to Nvidia H100 and Nvidia Blackwell. GPU Compute Capability.

LLM Example

from llama import Model

#
# first define and load/init model
#
model = Model(
    creator_hf_repo='HuggingFaceTB/SmolLM2-1.7B-Instruct',
    hf_repo='bartowski/SmolLM2-1.7B-Instruct-GGUF',
    hf_file='SmolLM2-1.7B-Instruct-Q4_K_M.gguf',
)

model.init(n_ctx=8 * 1024, gpu_layers=99)

#
# messages
#
messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': '1 + 1 = ?'},
    {'role': 'assistant', 'content': '2'},
    {'role': 'user', 'content': 'Evaluate 1 + 2 in Python.'},
]

completions = model.completions(
    messages=messages,
    predict=1 * 1024,
    temp=0.7,
    top_p=0.8,
    top_k=100,
)

for chunk in completions:
    print(chunk, flush=True, end='')

#
# prompt
#
prompt='Evaluate 1 + 2 in Python. Result in Python is'

completions = model.completions(
    prompt=prompt,
    predict=1 * 1024,
    temp=0.7,
    top_p=0.8,
    top_k=100,
)

for chunk in completions:
    print(chunk, flush=True, end='')

References

examples/llm.py
examples/demo_text.py

VLM Example

from llama import Model

#
# first define and load/init model
#
model = Model( # 1.87B
    creator_hf_repo='vikhyatk/moondream2',
    hf_repo='vikhyatk/moondream2',
    hf_file='moondream2-text-model-f16.gguf',
    mmproj_hf_file='moondream2-mmproj-f16.gguf',
)

model.init(n_ctx=8 * 1024, gpu_layers=99)

#
# prompt
#
prompt = 'Describe this image.'
image = 'examples/llama-1.png'

completions = model.completions(
    prompt=prompt,
    image=image,
    predict=1 * 1024,
)

for chunk in completions:
    print(chunk, flush=True, end='')

References

examples/vlm.py
examples/demo_llava.py
examples/demo_minicpmv.py
examples/demo_qwen2vl.py

API

Server - llama-cpp-cffi + OpenAI API

Run server first:

python -B -u -m llama.server
# or
python -B -u -m gunicorn --bind '0.0.0.0:11434' --timeout 900 --workers 1 --worker-class aiohttp.GunicornWebWorker 'llama.server:build_app()'

Client - llama-cpp-cffi API / curl

#
# llm
#
curl -XPOST 'http://localhost:11434/api/1.0/completions' \
-H "Content-Type: application/json" \
-d '{
    "gpu_layers": 99,
    "prompt": "Evaluate 1 + 2 in Python."
}'

curl -XPOST 'http://localhost:11434/api/1.0/completions' \
-H "Content-Type: application/json" \
-d '{
    "creator_hf_repo": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
    "hf_repo": "bartowski/SmolLM2-1.7B-Instruct-GGUF",
    "hf_file": "SmolLM2-1.7B-Instruct-Q4_K_M.gguf",
    "gpu_layers": 99,
    "prompt": "Evaluate 1 + 2 in Python."
}'

curl -XPOST 'http://localhost:11434/api/1.0/completions' \
-H "Content-Type: application/json" \
-d '{
    "creator_hf_repo": "Qwen/Qwen2.5-0.5B-Instruct",
    "hf_repo": "Qwen/Qwen2.5-0.5B-Instruct-GGUF",
    "hf_file": "qwen2.5-0.5b-instruct-q4_k_m.gguf",
    "gpu_layers": 99,
    "prompt": "Evaluate 1 + 2 in Python."
}'

curl -XPOST 'http://localhost:11434/api/1.0/completions' \
-H "Content-Type: application/json" \
-d '{
    "creator_hf_repo": "Qwen/Qwen2.5-7B-Instruct",
    "hf_repo": "bartowski/Qwen2.5-7B-Instruct-GGUF",
    "hf_file": "Qwen2.5-7B-Instruct-Q4_K_M.gguf",
    "gpu_layers": 99,
    "prompt": "Evaluate 1 + 2 in Python."
}'

#
# vlm - example 1
#
image_path="examples/llama-1.jpg"
mime_type=$(file -b --mime-type "$image_path")
base64_data=$(base64 -w 0 "$image_path")

cat << EOF > /tmp/temp.json
{
    "creator_hf_repo": "Qwen/Qwen2-VL-2B-Instruct",
    "hf_repo": "bartowski/Qwen2-VL-2B-Instruct-GGUF",
    "hf_file": "Qwen2-VL-2B-Instruct-Q4_K_M.gguf",
    "mmproj_hf_file": "mmproj-Qwen2-VL-2B-Instruct-f16.gguf",
    "gpu_layers": 99,
    "prompt": "Describe this image.",
    "image": "data:$mime_type;base64,$base64_data"
}
EOF

curl -XPOST 'http://localhost:11434/api/1.0/completions' \
-H "Content-Type: application/json" \
--data-binary "@/tmp/temp.json"

#
# vlm - example 2
#
image_path="examples/llama-1.jpg"
mime_type=$(file -b --mime-type "$image_path")
base64_data=$(base64 -w 0 "$image_path")

cat << EOF > /tmp/temp.json
{
    "creator_hf_repo": "Qwen/Qwen2-VL-2B-Instruct",
    "hf_repo": "bartowski/Qwen2-VL-2B-Instruct-GGUF",
    "hf_file": "Qwen2-VL-2B-Instruct-Q4_K_M.gguf",
    "mmproj_hf_file": "mmproj-Qwen2-VL-2B-Instruct-f16.gguf",
    "gpu_layers": 99,
    "messages": [
        {"role": "user", "content": [
            {"type": "text", "text": "Describe this image."},
            {
                "type": "image_url",
                "image_url": {"url": "data:$mime_type;base64,$base64_data"}
            }
        ]}
    ]
}
EOF

curl -XPOST 'http://localhost:11434/api/1.0/completions' \
-H "Content-Type: application/json" \
--data-binary "@/tmp/temp.json"

Client - OpenAI © compatible Chat Completions API

#
# text
#
curl -XPOST 'http://localhost:11434/v1/chat/completions' \
-H "Content-Type: application/json" \
-d '{
    "model": "HuggingFaceTB/SmolLM2-1.7B-Instruct:bartowski/SmolLM2-1.7B-Instruct-GGUF:SmolLM2-1.7B-Instruct-Q4_K_M.gguf",
    "messages": [
        {
            "role": "user",
            "content": "Evaluate 1 + 2 in Python."
        }
    ],
    "n_ctx": 8192,
    "gpu_layers": 99
}'

#
# image
#
image_path="examples/llama-1.jpg"
mime_type=$(file -b --mime-type "$image_path")
base64_data=$(base64 -w 0 "$image_path")

cat << EOF > /tmp/temp.json
{
    "model": "Qwen/Qwen2-VL-2B-Instruct:bartowski/Qwen2-VL-2B-Instruct-GGUF:Qwen2-VL-2B-Instruct-Q4_K_M.gguf:mmproj-Qwen2-VL-2B-Instruct-f16.gguf",
    "messages": [
        {"role": "user", "content": [
            {"type": "text", "text": "Describe this image."},
            {
                "type": "image_url",
                "image_url": {"url": "data:$mime_type;base64,$base64_data"}
            }
        ]}
    ],
    "n_ctx": 8192,
    "gpu_layers": 99
}
EOF

curl -XPOST 'http://localhost:11434/v1/chat/completions' \
-H "Content-Type: application/json" \
--data-binary "@/tmp/temp.json"

#
# Client Python API for OpenAI
#
python -B examples/demo_openai.py

References

examples/demo_openai.py

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.44

Apr 3, 2025

0.4.43

Mar 31, 2025

0.4.42

Mar 26, 2025

0.4.40

Mar 7, 2025

0.4.39

Mar 4, 2025

0.4.38

Mar 4, 2025

0.4.37

Mar 3, 2025

0.4.36

Mar 1, 2025

0.4.35

Feb 24, 2025

0.4.34

Feb 19, 2025

0.4.33

Feb 19, 2025

0.4.32

Feb 19, 2025

0.4.31

Feb 19, 2025

0.4.30

Feb 18, 2025

0.4.29

Feb 18, 2025

0.4.28

Feb 18, 2025

0.4.27

Feb 18, 2025

0.4.26

Feb 18, 2025

0.4.25

Feb 18, 2025

0.4.24

Feb 18, 2025

0.4.23

Feb 18, 2025

0.4.22

Feb 18, 2025

0.4.21

Feb 18, 2025

0.4.20

Feb 16, 2025

0.4.19

Feb 6, 2025

0.4.18

Jan 29, 2025

0.4.17

Jan 22, 2025

0.4.16

Jan 15, 2025

0.4.15

Jan 15, 2025

0.4.14

Jan 15, 2025

0.4.13

Jan 14, 2025

0.4.12

Jan 14, 2025

0.4.11

Jan 14, 2025

0.4.10

Jan 14, 2025

0.4.9

Jan 13, 2025

0.4.8

Jan 13, 2025

0.4.7

Jan 13, 2025

0.4.6

Jan 13, 2025

0.4.5

Jan 13, 2025

0.4.4

Jan 13, 2025

0.4.3

Jan 13, 2025

0.4.2

Jan 13, 2025

0.4.1

Jan 13, 2025

0.4.0

Jan 12, 2025

0.3.3

Jan 11, 2025

0.3.2

Jan 9, 2025

0.3.1

Jan 2, 2025

0.3.0

Dec 31, 2024

0.2.7

Dec 18, 2024

0.2.6

Dec 17, 2024

0.2.5

Dec 17, 2024

0.2.4

Dec 17, 2024

0.2.3

Dec 17, 2024

0.2.2

Dec 14, 2024

0.2.1

Dec 13, 2024

0.2.0

Dec 11, 2024

0.1.22

Nov 27, 2024

0.1.21

Sep 17, 2024

0.1.20

Sep 14, 2024

0.1.19

Sep 13, 2024

0.1.18

Sep 9, 2024

0.1.17

Sep 4, 2024

0.1.16

Sep 2, 2024

0.1.15

Aug 20, 2024

0.1.14

Aug 17, 2024

0.1.13

Aug 16, 2024

0.1.12

Aug 16, 2024

0.1.11

Aug 13, 2024

0.1.10

Aug 13, 2024

0.1.9

Aug 13, 2024

0.1.8

Aug 13, 2024

0.1.7

Aug 13, 2024

0.1.6

Aug 13, 2024

0.1.5

Jul 24, 2024

0.1.4

Jul 23, 2024

0.1.3

Jul 22, 2024

0.1.2

Jul 19, 2024

0.1.1

Jul 19, 2024

0.1.0

Jul 18, 2024

0.0.4

Jul 18, 2024

0.0.3

Jul 14, 2024

0.0.2

Jul 9, 2024

0.0.1

Jul 5, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llama_cpp_cffi-0.4.44-cp312-cp312-manylinux_2_34_x86_64.whl (292.0 MB view details)

Uploaded Apr 3, 2025 CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file llama_cpp_cffi-0.4.44-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

Download URL: llama_cpp_cffi-0.4.44-cp312-cp312-manylinux_2_34_x86_64.whl
Upload date: Apr 3, 2025
Size: 292.0 MB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.12.9 Linux/6.13.8-arch1-1

File hashes

Hashes for llama_cpp_cffi-0.4.44-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`2bb61a5d2d8c9a758f4b863743d9d08a53417933824fa61b2681cbde97e8b555`
MD5	`949d606ba87fd7eb0dec9a59c6d942cc`
BLAKE2b-256	`c5be682330f5e929e8e989113c022218364d9fcd444ad35b536e1e029e4e1a20`

See more details on using hashes here.

llama-cpp-cffi 0.4.44

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

llama-cpp-cffi

News

Install

LLM Example

References

VLM Example

References

API

Server - llama-cpp-cffi + OpenAI API

Client - llama-cpp-cffi API / curl

Client - OpenAI © compatible Chat Completions API

References

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes