llama cpp integration for serapeum

These details have not been verified by PyPI

Project links

Project description

Serapeum llama.cpp Provider

Local GGUF model inference for the Serapeum LLM framework

The serapeum-llama-cpp package runs quantised GGUF models locally using the llama-cpp-python backend. It provides:

Completion & Chat: Sync and async text generation with streaming
Multiple Model Sources: Load from a local path, direct URL, or HuggingFace Hub
Prompt Formatters: Ready-made formatters for Llama 2/Mistral and Llama 3 Instruct
GPU Offloading: Configurable layer offloading via n_gpu_layers
Thread Safety: Per-instance locking for concurrent inference
Model Caching: Shared WeakValueDictionary cache avoids duplicate loads

This adapter implements the serapeum.core.llms.LLM completion interface with CompletionToChatMixin, making it compatible with all Serapeum orchestrators and tools.

Installation

From Source

cd libs/providers/llama-cpp
uv sync --active

From PyPI (when published)

pip install serapeum-llama-cpp

Prerequisites

Before running the examples you need a GGUF model file. For a quick test you can download the tiny stories260K model (~500 KB):

# Download the model
curl -L -o stories260K.gguf \
  https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K.gguf

or just copy and paste the URL into your browser, and the download will start.

Set the LLAMA_MODEL_PATH environment variable to the downloaded file:

export LLAMA_MODEL_PATH="/path/to/stories260K.gguf"

Or add it to a .env file in your project root and load it before running:

# .env
LLAMA_MODEL_PATH=/path/to/stories260K.gguf

from dotenv import load_dotenv
load_dotenv()  # loads LLAMA_MODEL_PATH from .env

All examples below read the model path from this environment variable.

Quick Start

import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()
model_path = os.environ["LLAMA_MODEL_PATH"]

llm = LlamaCPP(
    model_path=model_path,
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

response = llm.complete("Once upon a time")
print(response.text)

Model Sources

Local File

import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()

llm = LlamaCPP(
    model_path=os.environ["LLAMA_MODEL_PATH"],
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

Direct URL

The model is downloaded, cached, and reused on subsequent runs:

from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama2 import (
    messages_to_prompt,
    completion_to_prompt,
)

llm = LlamaCPP(
    model_url="https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf",
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
)

HuggingFace Hub

Downloads via huggingface_hub with automatic caching and SHA-256 verification:

from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama2 import (
    messages_to_prompt,
    completion_to_prompt,
)

llm = LlamaCPP(
    hf_model_id="TheBloke/Llama-2-13B-chat-GGUF",
    hf_filename="llama-2-13b-chat.Q4_0.gguf",
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
)

Prompt Formatters

GGUF models require a specific chat template. Using the wrong formatter produces garbage output. Choose the formatter that matches your model family:

# Llama 3 Instruct / newer models
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

# Llama 2 / Mistral
from serapeum.llama_cpp.formatters.llama2 import (
    messages_to_prompt,
    completion_to_prompt,
)

Streaming

import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()

llm = LlamaCPP(
    model_path=os.environ["LLAMA_MODEL_PATH"],
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

for chunk in llm.complete("Once upon a time", stream=True):
    print(chunk.delta, end="", flush=True)
print()

Async

import asyncio
import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()

llm = LlamaCPP(
    model_path=os.environ["LLAMA_MODEL_PATH"],
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

async def main():
    response = await llm.acomplete("Once upon a time")
    print(response.text)

asyncio.run(main())

Configuration

import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()

llm = LlamaCPP(
    model_path=os.environ["LLAMA_MODEL_PATH"],
    temperature=0.1,              # Sampling temperature (0.0–1.0)
    max_new_tokens=256,           # Maximum tokens to generate
    context_window=4096,          # Context window size
    n_gpu_layers=-1,              # GPU layers (-1 = all)
    stop=["</s>", "<|eot_id|>"], # Stop sequences
    verbose=False,                # Suppress llama.cpp output
    generate_kwargs={},           # Extra kwargs for Llama.__call__
    model_kwargs={},              # Extra kwargs for Llama.__init__
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

Testing

# Unit and mock tests (no model required)
python -m pytest libs/providers/llama-cpp/tests -m "not e2e"

# End-to-end tests (requires a GGUF model)
# Set LLAMA_CPP_TEST_MODEL_PATH to your model file
python -m pytest libs/providers/llama-cpp/tests -m e2e

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Mar 12, 2026

0.1.0

Mar 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

serapeum_llama_cpp-0.2.0.tar.gz (19.1 kB view details)

Uploaded Mar 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

serapeum_llama_cpp-0.2.0-py3-none-any.whl (21.9 kB view details)

Uploaded Mar 12, 2026 Python 3

File details

Details for the file serapeum_llama_cpp-0.2.0.tar.gz.

File metadata

Download URL: serapeum_llama_cpp-0.2.0.tar.gz
Upload date: Mar 12, 2026
Size: 19.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for serapeum_llama_cpp-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`f7d5cbabc2e17ac4ea2cd3ecc1f79a68c89d9174ffe8523eac056e81f8183124`
MD5	`388f0fe5ab746ea42a78a37cc50abc3b`
BLAKE2b-256	`f5b922260a1481e5f45c64ad8ece7cf419ee4ec084c3dfd76a2e355bcba3438e`

See more details on using hashes here.

File details

Details for the file serapeum_llama_cpp-0.2.0-py3-none-any.whl.

File metadata

Download URL: serapeum_llama_cpp-0.2.0-py3-none-any.whl
Upload date: Mar 12, 2026
Size: 21.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for serapeum_llama_cpp-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e1e8c33478d834cf461da3e80aac7212783bd2eeeaa607a68b450c8ddd3de803`
MD5	`de62577a8246173bc622308e6f515a83`
BLAKE2b-256	`5a1b1e5f25a4d3e09ff3bebb9e3a230399d716118e4372bfeb405613eb92d6a1`

See more details on using hashes here.

serapeum-llama-cpp 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Serapeum llama.cpp Provider

Installation

From Source

From PyPI (when published)

Prerequisites

Quick Start

Model Sources

Local File

Direct URL

HuggingFace Hub

Prompt Formatters

Streaming

Async

Configuration

Testing

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes