Skip to main content

Python binding for llama.cpp using cffi

Project description

llama-cpp-cffi

PyPI Supported Versions PyPI Downloads Github Downloads License: MIT

Python binding for llama.cpp using cffi. Supports CPU, CUDA 12.5.1 and CUDA 12.6 execution.

NOTE: Currently supported operating system is Linux (manylinux_2_28 and musllinux_1_2), but we are working on both Windows and MacOS versions.

Install

Basic library install:

pip install llama-cpp-cffi

In case you want OpenAI © Chat Completions API compatible API:

pip install llama-cpp-cffi[openai]

IMPORTANT: If you want to take advantage of Nvidia GPU acceleration, make sure that you have installed CUDA 12. If you don't have CUDA 12.X installed follow instructions here: https://developer.nvidia.com/cuda-downloads .

GPU Compute Capability: compute_61, compute_70, compute_75, compute_80, compute_86, compute_89 covering from most of GPUs from GeForce GTX 1050 to NVIDIA H100. GPU Compute Capability.

Example

Library Usage

References:

  • examples/demo_tinyllama_chat.py
  • examples/demo_tinyllama_tool.py
from llama import llama_generate, get_config, Model, Options


model = Model(
    creator_hf_repo='TinyLlama/TinyLlama-1.1B-Chat-v1.0',
    hf_repo='TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF',
    hf_file='tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf',
)

config = get_config(model.creator_hf_repo)

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': '1 + 1 = ?'},
    {'role': 'assistant', 'content': '2'},
    {'role': 'user', 'content': 'Evaluate 1 + 2 in Python.'},
]

options = Options(
    ctx_size=config.max_position_embeddings,
    predict=-2,
    model=model,
    prompt=messages,
)

for chunk in llama_generate(options):
    print(chunk, flush=True, end='')

# newline
print()

OpenAI © compatible Chat Completions API - Server and Client

Run OpenAI compatible server:

python -m llama.openai
# or
python -B -u -m gunicorn --bind '0.0.0.0:11434' --timeout 300 --workers 1 --worker-class aiohttp.GunicornWebWorker 'llama.openai:build_app()'

Run OpenAI compatible client examples/demo_openai_0.py:

python -B examples/demo_openai_0.py
from openai import OpenAI
from llama import Model


client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='llama-cpp-cffi',
)

model = Model(
    creator_hf_repo='TinyLlama/TinyLlama-1.1B-Chat-v1.0',
    hf_repo='TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF',
    hf_file='tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf',
)

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': '1 + 1 = ?'},
    {'role': 'assistant', 'content': '2'},
    {'role': 'user', 'content': 'Evaluate 1 + 2 in Python.'}
]


def demo_chat_completions():
    print('demo_chat_completions:')

    response = client.chat.completions.create(
        model=str(model),
        messages=messages,
        temperature=0.0,
    )

    print(response.choices[0].message.content)


def demo_chat_completions_stream():
    print('demo_chat_completions_stream:')

    response = client.chat.completions.create(
        model=str(model),
        messages=messages,
        temperature=0.0,
        stream=True,
    )

    for chunk in response:
        print(chunk.choices[0].delta.content, flush=True, end='')

    print()


if __name__ == '__main__':
    demo_chat_completions()
    demo_chat_completions_stream()

Demos

python -B examples/demo_tinyllama_chat.py
python -B examples/demo_tinyllama_tool.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

llama_cpp_cffi-0.1.15-pp310-pypy310_pp73-manylinux_2_28_x86_64.whl (145.6 MB view hashes)

Uploaded PyPy manylinux: glibc 2.28+ x86-64

llama_cpp_cffi-0.1.15-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl (12.5 MB view hashes)

Uploaded PyPy manylinux: glibc 2.28+ ARM64

llama_cpp_cffi-0.1.15-cp313-cp313-musllinux_1_2_x86_64.whl (24.3 MB view hashes)

Uploaded CPython 3.13 musllinux: musl 1.2+ x86-64

llama_cpp_cffi-0.1.15-cp313-cp313-musllinux_1_2_aarch64.whl (10.4 MB view hashes)

Uploaded CPython 3.13 musllinux: musl 1.2+ ARM64

llama_cpp_cffi-0.1.15-cp313-cp313-manylinux_2_28_x86_64.whl (145.7 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.28+ x86-64

llama_cpp_cffi-0.1.15-cp313-cp313-manylinux_2_28_aarch64.whl (12.6 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.28+ ARM64

llama_cpp_cffi-0.1.15-cp312-cp312-musllinux_1_2_x86_64.whl (24.3 MB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.2+ x86-64

llama_cpp_cffi-0.1.15-cp312-cp312-musllinux_1_2_aarch64.whl (10.4 MB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.2+ ARM64

llama_cpp_cffi-0.1.15-cp312-cp312-manylinux_2_28_x86_64.whl (145.7 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.28+ x86-64

llama_cpp_cffi-0.1.15-cp312-cp312-manylinux_2_28_aarch64.whl (12.6 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.28+ ARM64

llama_cpp_cffi-0.1.15-cp311-cp311-musllinux_1_2_x86_64.whl (24.3 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.2+ x86-64

llama_cpp_cffi-0.1.15-cp311-cp311-musllinux_1_2_aarch64.whl (10.4 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.2+ ARM64

llama_cpp_cffi-0.1.15-cp311-cp311-manylinux_2_28_x86_64.whl (145.7 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

llama_cpp_cffi-0.1.15-cp311-cp311-manylinux_2_28_aarch64.whl (12.6 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ ARM64

llama_cpp_cffi-0.1.15-cp310-cp310-musllinux_1_2_x86_64.whl (24.3 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.2+ x86-64

llama_cpp_cffi-0.1.15-cp310-cp310-musllinux_1_2_aarch64.whl (10.4 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.2+ ARM64

llama_cpp_cffi-0.1.15-cp310-cp310-manylinux_2_28_x86_64.whl (145.7 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

llama_cpp_cffi-0.1.15-cp310-cp310-manylinux_2_28_aarch64.whl (12.6 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ ARM64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page