Skip to main content

Python binding for llama.cpp using cffi

Project description

llama-cpp-cffi

Downloads Supported Versions License: MIT

Python binding for llama.cpp using cffi and ctypes. Supports CPU and CUDA 12.5 execution.

Install

Basic library install:

pip install llama-cpp-cffi

In case you want Chat Completions API by OpenAI © compatible API:

pip install llama-cpp-cffi[openai]

IMPORTANT: If you want to take advantage of Nvidia GPU acceleration, make sure that you have installed CUDA 12.5. If you don't have CUDA 12.5 installed follow instructions here: https://developer.nvidia.com/cuda-downloads

Example

Library Usage

examples/demo_0.py

from llama import llama_generate, Model, Options
from llama import get_config

model = Model(
    creator_hf_repo='TinyLlama/TinyLlama-1.1B-Chat-v1.0',
    hf_repo='TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF',
    hf_file='tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf',
)

config = get_config(model.creator_hf_repo)

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': '1 + 1 = ?'},
    {'role': 'assistant', 'content': '2'},
    {'role': 'user', 'content': 'Evaluate 1 + 2 in Python.'},
]

options = Options(
    ctx_size=config.max_position_embeddings,
    predict=-2,
    model=model,
    prompt=messages,
)

for chunk in llama_generate(options):
    print(chunk, flush=True, end='')

# newline
print()

OpenAI © compatible Chat Completions (TBD)

Run OpenAI compatible server:

python -B llama/openai.py

Run example examples/demo_1.py using OpenAI module:

from openai import OpenAI
from llama import Model

client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='llama-cpp-cffi',
)

model = Model(
    creator_hf_repo='TinyLlama/TinyLlama-1.1B-Chat-v1.0',
    hf_repo='TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF',
    hf_file='tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf',
)

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': '1 + 1 = ?'},
    {'role': 'assistant', 'content': '2'},
    {'role': 'user', 'content': 'Evaluate 1 + 2 in Python.'}
]


def demo_chat_completions():
    print('demo_chat_completions:')

    response = client.chat.completions.create(
        model=str(model),
        messages=messages,
        temperature=0.0,
    )

    print(response.choices[0].message.content)


def demo_chat_completions_stream():
    print('demo_chat_completions_stream:')

    response = client.chat.completions.create(
        model=str(model),
        messages=messages,
        temperature=0.0,
        stream=True,
    )

    for chunk in response:
        print(chunk.choices[0].delta.content, flush=True, end='')

    print()


if __name__ == '__main__':
    demo_chat_completions()
    demo_chat_completions_stream()

Demos

#
# run demos
#
python -B examples/demo_0.py
python -B examples/demo_cffi_cpu.py
python -B examples/demo_cffi_cuda_12_5.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

llama_cpp_cffi-0.1.3-pp310-pypy310_pp73-manylinux_2_28_x86_64.whl (104.7 MB view hashes)

Uploaded PyPy manylinux: glibc 2.28+ x86-64

llama_cpp_cffi-0.1.3-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl (764.2 kB view hashes)

Uploaded PyPy manylinux: glibc 2.28+ ARM64

llama_cpp_cffi-0.1.3-cp312-cp312-musllinux_1_2_x86_64.whl (887.2 kB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.2+ x86-64

llama_cpp_cffi-0.1.3-cp312-cp312-musllinux_1_2_aarch64.whl (784.8 kB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.2+ ARM64

llama_cpp_cffi-0.1.3-cp312-cp312-manylinux_2_28_x86_64.whl (104.7 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.28+ x86-64

llama_cpp_cffi-0.1.3-cp312-cp312-manylinux_2_28_aarch64.whl (773.8 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.28+ ARM64

llama_cpp_cffi-0.1.3-cp311-cp311-musllinux_1_2_x86_64.whl (887.1 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.2+ x86-64

llama_cpp_cffi-0.1.3-cp311-cp311-musllinux_1_2_aarch64.whl (785.0 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.2+ ARM64

llama_cpp_cffi-0.1.3-cp311-cp311-manylinux_2_28_x86_64.whl (104.7 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

llama_cpp_cffi-0.1.3-cp311-cp311-manylinux_2_28_aarch64.whl (773.7 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ ARM64

llama_cpp_cffi-0.1.3-cp310-cp310-musllinux_1_2_x86_64.whl (887.2 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.2+ x86-64

llama_cpp_cffi-0.1.3-cp310-cp310-musllinux_1_2_aarch64.whl (785.0 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.2+ ARM64

llama_cpp_cffi-0.1.3-cp310-cp310-manylinux_2_28_x86_64.whl (104.7 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

llama_cpp_cffi-0.1.3-cp310-cp310-manylinux_2_28_aarch64.whl (773.9 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ ARM64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page