Skip to main content

llama cpp integration for serapeum

Project description

Serapeum llama.cpp Provider

Local GGUF model inference for the Serapeum LLM framework

The serapeum-llama-cpp package runs quantised GGUF models locally using the llama-cpp-python backend. It provides:

  • Completion & Chat: Sync and async text generation with streaming
  • Multiple Model Sources: Load from a local path, direct URL, or HuggingFace Hub
  • Prompt Formatters: Ready-made formatters for Llama 2/Mistral and Llama 3 Instruct
  • GPU Offloading: Configurable layer offloading via n_gpu_layers
  • Thread Safety: Per-instance locking for concurrent inference
  • Model Caching: Shared WeakValueDictionary cache avoids duplicate loads

This adapter implements the serapeum.core.llms.LLM completion interface with CompletionToChatMixin, making it compatible with all Serapeum orchestrators and tools.

Installation

From Source

cd libs/providers/llama-cpp
uv sync --active

From PyPI (when published)

pip install serapeum-llama-cpp

Prerequisites

Before running the examples you need a GGUF model file. For a quick test you can download the tiny stories260K model (~500 KB):

# Download the model
curl -L -o stories260K.gguf \
  https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K.gguf

or just copy and paste the URL into your browser, and the download will start.

Set the LLAMA_MODEL_PATH environment variable to the downloaded file:

export LLAMA_MODEL_PATH="/path/to/stories260K.gguf"

Or add it to a .env file in your project root and load it before running:

# .env
LLAMA_MODEL_PATH=/path/to/stories260K.gguf
from dotenv import load_dotenv
load_dotenv()  # loads LLAMA_MODEL_PATH from .env

All examples below read the model path from this environment variable.

Quick Start

import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()
model_path = os.environ["LLAMA_MODEL_PATH"]

llm = LlamaCPP(
    model_path=model_path,
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

response = llm.complete("Once upon a time")
print(response.text)

Model Sources

Local File

import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()

llm = LlamaCPP(
    model_path=os.environ["LLAMA_MODEL_PATH"],
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

Direct URL

The model is downloaded, cached, and reused on subsequent runs:

from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama2 import (
    messages_to_prompt,
    completion_to_prompt,
)

llm = LlamaCPP(
    model_url="https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf",
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
)

HuggingFace Hub

Downloads via huggingface_hub with automatic caching and SHA-256 verification:

from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama2 import (
    messages_to_prompt,
    completion_to_prompt,
)

llm = LlamaCPP(
    hf_model_id="TheBloke/Llama-2-13B-chat-GGUF",
    hf_filename="llama-2-13b-chat.Q4_0.gguf",
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
)

Prompt Formatters

GGUF models require a specific chat template. Using the wrong formatter produces garbage output. Choose the formatter that matches your model family:

# Llama 3 Instruct / newer models
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

# Llama 2 / Mistral
from serapeum.llama_cpp.formatters.llama2 import (
    messages_to_prompt,
    completion_to_prompt,
)

Streaming

import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()

llm = LlamaCPP(
    model_path=os.environ["LLAMA_MODEL_PATH"],
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

for chunk in llm.complete("Once upon a time", stream=True):
    print(chunk.delta, end="", flush=True)
print()

Async

import asyncio
import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()

llm = LlamaCPP(
    model_path=os.environ["LLAMA_MODEL_PATH"],
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

async def main():
    response = await llm.acomplete("Once upon a time")
    print(response.text)

asyncio.run(main())

Configuration

import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()

llm = LlamaCPP(
    model_path=os.environ["LLAMA_MODEL_PATH"],
    temperature=0.1,              # Sampling temperature (0.0–1.0)
    max_new_tokens=256,           # Maximum tokens to generate
    context_window=4096,          # Context window size
    n_gpu_layers=-1,              # GPU layers (-1 = all)
    stop=["</s>", "<|eot_id|>"], # Stop sequences
    verbose=False,                # Suppress llama.cpp output
    generate_kwargs={},           # Extra kwargs for Llama.__call__
    model_kwargs={},              # Extra kwargs for Llama.__init__
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

Testing

# Unit and mock tests (no model required)
python -m pytest libs/providers/llama-cpp/tests -m "not e2e"

# End-to-end tests (requires a GGUF model)
# Set LLAMA_CPP_TEST_MODEL_PATH to your model file
python -m pytest libs/providers/llama-cpp/tests -m e2e

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

serapeum_llama_cpp-0.2.0.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

serapeum_llama_cpp-0.2.0-py3-none-any.whl (21.9 kB view details)

Uploaded Python 3

File details

Details for the file serapeum_llama_cpp-0.2.0.tar.gz.

File metadata

  • Download URL: serapeum_llama_cpp-0.2.0.tar.gz
  • Upload date:
  • Size: 19.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for serapeum_llama_cpp-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f7d5cbabc2e17ac4ea2cd3ecc1f79a68c89d9174ffe8523eac056e81f8183124
MD5 388f0fe5ab746ea42a78a37cc50abc3b
BLAKE2b-256 f5b922260a1481e5f45c64ad8ece7cf419ee4ec084c3dfd76a2e355bcba3438e

See more details on using hashes here.

File details

Details for the file serapeum_llama_cpp-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for serapeum_llama_cpp-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e1e8c33478d834cf461da3e80aac7212783bd2eeeaa607a68b450c8ddd3de803
MD5 de62577a8246173bc622308e6f515a83
BLAKE2b-256 5a1b1e5f25a4d3e09ff3bebb9e3a230399d716118e4372bfeb405613eb92d6a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page