Skip to main content

llama cpp integration for serapeum

Project description

Serapeum llama.cpp Provider

Local GGUF model inference for the Serapeum LLM framework

The serapeum-llama-cpp package runs quantised GGUF models locally using the llama-cpp-python backend. It provides:

  • Completion & Chat: Sync and async text generation with streaming
  • Multiple Model Sources: Load from a local path, direct URL, or HuggingFace Hub
  • Prompt Formatters: Ready-made formatters for Llama 2/Mistral and Llama 3 Instruct
  • GPU Offloading: Configurable layer offloading via n_gpu_layers
  • Thread Safety: Per-instance locking for concurrent inference
  • Model Caching: Shared WeakValueDictionary cache avoids duplicate loads

This adapter implements the serapeum.core.llms.LLM completion interface with CompletionToChatMixin, making it compatible with all Serapeum orchestrators and tools.

Installation

From Source

cd libs/providers/llama-cpp
uv sync --active

From PyPI (when published)

pip install serapeum-llama-cpp

Prerequisites

Before running the examples you need a GGUF model file. For a quick test you can download the tiny stories260K model (~500 KB):

# Download the model
curl -L -o stories260K.gguf \
  https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K.gguf

or just copy and paste the URL into your browser, and the download will start.

Set the LLAMA_MODEL_PATH environment variable to the downloaded file:

export LLAMA_MODEL_PATH="/path/to/stories260K.gguf"

Or add it to a .env file in your project root and load it before running:

# .env
LLAMA_MODEL_PATH=/path/to/stories260K.gguf
from dotenv import load_dotenv
load_dotenv()  # loads LLAMA_MODEL_PATH from .env

All examples below read the model path from this environment variable.

Quick Start

import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()
model_path = os.environ["LLAMA_MODEL_PATH"]

llm = LlamaCPP(
    model_path=model_path,
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

response = llm.complete("Once upon a time")
print(response.text)

Model Sources

Local File

import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()

llm = LlamaCPP(
    model_path=os.environ["LLAMA_MODEL_PATH"],
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

Direct URL

The model is downloaded, cached, and reused on subsequent runs:

from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama2 import (
    messages_to_prompt,
    completion_to_prompt,
)

llm = LlamaCPP(
    model_url="https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf",
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
)

HuggingFace Hub

Downloads via huggingface_hub with automatic caching and SHA-256 verification:

from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama2 import (
    messages_to_prompt,
    completion_to_prompt,
)

llm = LlamaCPP(
    hf_model_id="TheBloke/Llama-2-13B-chat-GGUF",
    hf_filename="llama-2-13b-chat.Q4_0.gguf",
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
)

Prompt Formatters

GGUF models require a specific chat template. Using the wrong formatter produces garbage output. Choose the formatter that matches your model family:

# Llama 3 Instruct / newer models
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

# Llama 2 / Mistral
from serapeum.llama_cpp.formatters.llama2 import (
    messages_to_prompt,
    completion_to_prompt,
)

Streaming

import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()

llm = LlamaCPP(
    model_path=os.environ["LLAMA_MODEL_PATH"],
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

for chunk in llm.complete("Once upon a time", stream=True):
    print(chunk.delta, end="", flush=True)
print()

Async

import asyncio
import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()

llm = LlamaCPP(
    model_path=os.environ["LLAMA_MODEL_PATH"],
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

async def main():
    response = await llm.acomplete("Once upon a time")
    print(response.text)

asyncio.run(main())

Configuration

import os
from dotenv import load_dotenv
from serapeum.llama_cpp import LlamaCPP
from serapeum.llama_cpp.formatters.llama3 import (
    messages_to_prompt_v3_instruct,
    completion_to_prompt_v3_instruct,
)

load_dotenv()

llm = LlamaCPP(
    model_path=os.environ["LLAMA_MODEL_PATH"],
    temperature=0.1,              # Sampling temperature (0.0–1.0)
    max_new_tokens=256,           # Maximum tokens to generate
    context_window=4096,          # Context window size
    n_gpu_layers=-1,              # GPU layers (-1 = all)
    stop=["</s>", "<|eot_id|>"], # Stop sequences
    verbose=False,                # Suppress llama.cpp output
    generate_kwargs={},           # Extra kwargs for Llama.__call__
    model_kwargs={},              # Extra kwargs for Llama.__init__
    messages_to_prompt=messages_to_prompt_v3_instruct,
    completion_to_prompt=completion_to_prompt_v3_instruct,
)

Testing

# Unit and mock tests (no model required)
python -m pytest libs/providers/llama-cpp/tests -m "not e2e"

# End-to-end tests (requires a GGUF model)
# Set LLAMA_CPP_TEST_MODEL_PATH to your model file
python -m pytest libs/providers/llama-cpp/tests -m e2e

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

serapeum_llama_cpp-0.1.0.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

serapeum_llama_cpp-0.1.0-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file serapeum_llama_cpp-0.1.0.tar.gz.

File metadata

  • Download URL: serapeum_llama_cpp-0.1.0.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for serapeum_llama_cpp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 de05110275fac3afb7b583fbc97abd4beb2d23b8bbfe51f720238b791111453e
MD5 7688aaa267bb563bbc0b80b9c336c414
BLAKE2b-256 a371a035b751f08c59d8dade719718bd19a9deca9a2f326391303d0e6f1c3509

See more details on using hashes here.

File details

Details for the file serapeum_llama_cpp-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for serapeum_llama_cpp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1bb9ca828cd505629f119e2e381f2804c87cb0803b6a45bf1ef3345187cf490c
MD5 226901d3215c14e363d83e928240469f
BLAKE2b-256 5190f8bd8aad35f58452d870767e5bfe4e0724799149817ab94c27177b11b343

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page