Sinapsis templates for LLM text completion with LLaMA-CPP
Project description
Sinapsis LLaMA CPP
Sinapsis Templates for LLM text completion with LLaMA-CPP
🐍 Installation • 🚀 Features • 📚 Usage example • 🌐 Webapps 📙 Documentation • 🔍 License
The sinapsis-llama-cpp module provides a suite of templates to run LLMs with llama-cpp.
[!IMPORTANT] We now include support for Llama4 models!
To use them, install the dependency (if you have not installed sinapsis-llama-cpp[all]):
uv pip install sinapsis-llama-cpp[llama-four] --extra-index-url https://pypi.sinapsis.tech
You need a HuggingFace token. See the official instructions and set it using:
export HF_TOKEN=<token-provided-by-hf>
And test it through the cli or the webapp by changing the AGENT_CONFIG_PATH
[!NOTE] Llama 4 requires large GPUs to run the models. Nonetheless, running on smaller consumer-grade GPUs is possible, although a single inference may take hours
🐍 Installation
Install using your package manager of choice. We encourage the use of uv
Example with uv:
uv pip install sinapsis-llama-cpp --extra-index-url https://pypi.sinapsis.tech
or with raw pip:
pip install sinapsis-llama-cpp --extra-index-url https://pypi.sinapsis.tech
[!IMPORTANT] Templates may require extra dependencies. For development, we recommend installing the package with all the optional dependencies:
with uv:
uv pip install sinapsis-llama-cpp[all] --extra-index-url https://pypi.sinapsis.tech
or with raw pip:
pip install sinapsis-llama-cpp[all] --extra-index-url https://pypi.sinapsis.tech
🚀 Features
Templates Supported
-
LLaMATextCompletion: Template for text completion using LLaMA CPP.
Attributes
init_args(LLaMAInitArgs, required): LLaMA model arguments.llm_model_name(str, required): The name or path of the LLM model to use (e.g. 'TheBloke/Llama-2-7B-GGUF').llm_model_file(str, required): The specific GGUF model file (e.g., 'llama-2-7b.Q2_K.gguf').n_gpu_layers(int, optional): Number of layers to offload to the GPU (-1 for all). Defaults to0.use_mmap(bool, optional): Use 'memory-mapping' to load the model. Defaults toTrue.use_mlock(bool, optional): Force the model to be kept in RAM. Defaults toFalse.seed(int, optional): RNG seed for model initialization. Defaults toLLAMA_DEFAULT_SEED.n_ctx(int, optional): The context window size. Defaults to512.n_batch(int, optional): The batch size for prompt processing. Defaults to512.n_ubatch(int, optional): The batch size for token generation. Defaults to512.n_threads(int, optional): CPU threads for generation. Defaults toNone.n_threads_batch(int, optional): CPU threads for batch processing. Defaults toNone.flash_attn(bool, optional): Enable Flash Attention if supported by the GPU. Defaults toFalse.chat_format(str, optional): Chat template format (e.g., 'chatml'). Defaults toNone.verbose(bool, optional): Enable verbose logging from llama.cpp. Defaults toTrue.
completion_args(LLaMACompletionArgs, required): Generation arguments to pass to the selected model.temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to0.2.top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to0.95.top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to40.max_tokens(int, required): The maximum number of new tokens to generate.min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults to0.05.stop(str | list[str], optional): Stop sequences to halt generation. Defaults toNone.seed(int, optional): Overrides the model's seed just for this call. Defaults toNone.repeat_penalty(float, optional): Penalty for repeating tokens (1.0 = no penalty). Defaults to1.0.presence_penalty(float, optional): Penalty for new tokens (0.0 = no penalty). Defaults to0.0.frequency_penalty(float, optional): Penalty for frequent tokens (0.0 = no penalty). Defaults to0.0.logit_bias(dict[int, float], optional): Applies a bias to specific tokens. Defaults toNone.
chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.pattern(dict, optional): A regex pattern used to post-process the model's response.keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
-
LLaMATextCompletionWithMCP: Template for text completion with MCP tool integration using LLaMA CPP.
Attributes
init_args(LLaMAInitArgs, required): LLaMA model arguments.llm_model_name(str, required): The name or path of the LLM model to use (e.g. 'TheBloke/Llama-2-7B-GGUF').llm_model_file(str, required): The specific GGUF model file (e.g., 'llama-2-7b.Q2_K.gguf').n_gpu_layers(int, optional): Number of layers to offload to the GPU (-1 for all). Defaults to0.use_mmap(bool, optional): Use 'memory-mapping' to load the model. Defaults toTrue.use_mlock(bool, optional): Force the model to be kept in RAM. Defaults toFalse.seed(int, optional): RNG seed for model initialization. Defaults toLLAMA_DEFAULT_SEED.n_ctx(int, optional): The context window size. Defaults to512.n_batch(int, optional): The batch size for prompt processing. Defaults to512.n_ubatch(int, optional): The batch size for token generation. Defaults to512.n_threads(int, optional): CPU threads for generation. Defaults toNone.n_threads_batch(int, optional): CPU threads for batch processing. Defaults toNone.flash_attn(bool, optional): Enable Flash Attention if supported by the GPU. Defaults toFalse.chat_format(str, optional): Chat template format (e.g., 'chatml'). Defaults toNone.verbose(bool, optional): Enable verbose logging from llama.cpp. Defaults toTrue.
completion_args(LLaMACompletionArgs, required): Generation arguments to pass to the selected model.temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to0.2.top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to0.95.top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to40.max_tokens(int, required): The maximum number of new tokens to generate.min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults to0.05.stop(str | list[str], optional): Stop sequences to halt generation. Defaults toNone.seed(int, optional): Overrides the model's seed just for this call. Defaults toNone.repeat_penalty(float, optional): Penalty for repeating tokens (1.0 = no penalty). Defaults to1.0.presence_penalty(float, optional): Penalty for new tokens (0.0 = no penalty). Defaults to0.0.frequency_penalty(float, optional): Penalty for frequent tokens (0.0 = no penalty). Defaults to0.0.logit_bias(dict[int, float], optional): Applies a bias to specific tokens. Defaults toNone.
chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.pattern(dict, optional): A regex pattern used to post-process the model's response.keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.tools_key(str, optional): Key used to extract the raw tools from the data container. Defaults to"".max_tool_retries(int, optional): Maximum consecutive tool execution failures before stopping. Defaults to3.add_tool_to_prompt(bool, optional): Whether to automatically append tool descriptions to the system prompt. Defaults toTrue.
-
LLama4TextToText: Template for text-to-text chat processing using the LLama 4 model.
Attributes
init_args(LLaMA4InitArgs, required): LLaMA4 model arguments.llm_model_name(str, required): The name or path of the LLM model to use (e.g., 'meta-llama/Llama-4-Scout-17B-16E-Instruct').cache_dir(str, optional): Path to use for the model cache and download.device_map(str, optional): Device mapping forfrom_pretrained. Defaults toauto.torch_dtype(str, optional): Model tensor precision (e.g., 'auto', 'float16'). Defaults toauto.max_memory(dict, optional): Max memory allocation per device. Defaults toNone.
completion_args(LLMCompletionArgs, required): Generation arguments to pass to the selected model.temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to0.2.top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to0.95.top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to40.max_length(int, optional): The maximum length of the sequence (prompt + generation). Defaults to20.max_new_tokens(int, optional): The maximum number of new tokens to generate. Defaults toNone.do_sample(bool, optional): Whether to use sampling (True) or greedy decoding (False). Defaults toTrue.min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults toNone.repetition_penalty(float, optional): Penalty applied to repeated tokens (1.0 = no penalty). Defaults to1.0.
chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.pattern(dict, optional): A regex pattern used to post-process the model's response.keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
-
LLama4MultiModal: Template for multi modal chat processing using the LLama 4 model.
Attributes
init_args(LLaMA4InitArgs, required): LLaMA4 model arguments.llm_model_name(str, required): The name or path of the LLM model to use (e.g., 'meta-llama/Llama-4-Scout-17B-16E-Instruct').cache_dir(str, optional): Path to use for the model cache and download.device_map(str, optional): Device mapping forfrom_pretrained. Defaults toauto.torch_dtype(str, optional): Model tensor precision (e.g., 'auto', 'float16'). Defaults toauto.max_memory(dict, optional): Max memory allocation per device. Defaults toNone.
completion_args(LLMCompletionArgs, required): Generation arguments to pass to the selected model.temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to0.2.top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to0.95.top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to40.max_length(int, optional): The maximum length of the sequence (prompt + generation). Defaults to20.max_new_tokens(int, optional): The maximum number of new tokens to generate. Defaults toNone.do_sample(bool, optional): Whether to use sampling (True) or greedy decoding (False). Defaults toTrue.min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults toNone.repetition_penalty(float, optional): Penalty applied to repeated tokens (1.0 = no penalty). Defaults to1.0.
chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.pattern(dict, optional): A regex pattern used to post-process the model's response.keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
[!TIP] Use CLI command
sinapsis info --all-template-namesto show a list with all the available Template names installed with Sinapsis Data Tools.
[!TIP] Use CLI command
sinapsis info --example-template-config TEMPLATE_NAMEto produce an example Agent config for the Template specified in TEMPLATE_NAME.
For example, for LLaMATextCompletion use sinapsis info --example-template-config LLaMATextCompletion to produce the following example config:
agent:
name: my_test_agent
templates:
- template_name: InputTemplate
class_name: InputTemplate
attributes: {}
- template_name: LLaMATextCompletion
class_name: LLaMATextCompletion
template_input: InputTemplate
attributes:
init_args:
llm_model_name: '`replace_me:<class ''str''>`'
llm_model_file: '`replace_me:<class ''str''>`'
n_gpu_layers: 0
use_mmap: true
use_mlock: false
seed: 4294967295
n_ctx: 512
n_batch: 512
n_ubatch: 512
n_threads: null
n_threads_batch: null
flash_attn: false
chat_format: null
verbose: true
completion_args:
temperature: 0.2
top_p: 0.95
top_k: 40
max_tokens: '`replace_me:<class ''int''>`'
min_p: 0.05
stop: null
seed: null
repeat_penalty: 1.0
presence_penalty: 0.0
frequency_penalty: 0.0
logit_bias: null
chat_history_key: null
rag_context_key: null
system_prompt: null
pattern: null
keep_before: true
📚 Usage example
The following agent passes a text message through a TextPacket and retrieves a response from a LLMConfig
agent:
name: chat_completion
description: Chatbot agent using DeepSeek-R1
templates:
- template_name: InputTemplate
class_name: InputTemplate
attributes: {}
- template_name: TextInput
class_name: TextInput
template_input: InputTemplate
attributes:
text: what is AI?
- template_name: LLaMATextCompletion
class_name: LLaMATextCompletion
template_input: TextInput
attributes:
init_args:
llm_model_name: bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF
llm_model_file: DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf
n_ctx: 8192
n_threads: 8
n_gpu_layers: -1
chat_format: chatml
flash_attn: true
seed: 10
completion_args:
max_tokens: 4096
temperature: 0.2
seed: 10
system_prompt : 'You are a helpful assistant'
pattern: "</think>"
keep_before: False
🌐 Webapps
This module includes a webapp to interact with the model
[!IMPORTANT] To run the app you first need to clone this repository:
git clone git@github.com:Sinapsis-ai/sinapsis-chatbots.git
cd sinapsis-chatbots
[!NOTE] If you'd like to enable external app sharing in Gradio,
export GRADIO_SHARE_APP=True
[!IMPORTANT] You can change the model name and the number of gpu_layers used by the model in case you have an Out of Memory (OOM) error
🐳 Docker
IMPORTANT This docker image depends on the sinapsis-nvidia:base image. Please refer to the official sinapsis instructions to Build with Docker.
- Build the sinapsis-chatbots image:
docker compose -f docker/compose.yaml build
- Start the container
docker compose -f docker/compose_apps.yaml up sinapsis-simple-chatbot -d
- Check the status:
docker logs -f sinapsis-simple-chatbot
- The logs will display the URL to access the webapp, e.g.,:
Running on local URL: http://127.0.0.1:7860
NOTE: The url may be different, check the logs 4. To stop the app:
docker compose -f docker/compose_apps.yaml down
To use a different chatbot configuration (e.g. OpenAI-based chat), update the AGENT_CONFIG_PATH environmental variable to point to the desired YAML file.
For example, to use OpenAI chat:
environment:
AGENT_CONFIG_PATH: webapps/configs/openai_simple_chat.yaml
OPENAI_API_KEY: your_api_key
💻 UV
- Export the environment variable to install the python bindings for llama-cpp
export CMAKE_ARGS="-DGGML_CUDA=on"
export FORCE_CMAKE="1"
- export CUDACXX:
export CUDACXX=$(command -v nvcc)
- Create the virtual environment and sync dependencies:
uv sync --frozen
- Install the wheel:
uv pip install sinapsis-chatbots[all] --extra-index-url https://pypi.sinapsis.tech
- Run the webapp:
uv run webapps/llama_cpp_simple_chatbot.py
NOTE: To use OpenAI for the simple chatbot, set your API key and specify the correct configuration file
export AGENT_CONFIG_PATH=webapps/configs/openai_simple_chat.yaml
export OPENAI_API_KEY=your_api_key
and run step 5 again
- The terminal will display the URL to access the webapp, e.g.:
NOTE: The url can be different, check the output of the terminal
Running on local URL: http://127.0.0.1:7860
📙 Documentation
Documentation for this and other sinapsis packages is available on the sinapsis website
Tutorials for different projects within sinapsis are available at sinapsis tutorials page
🔍 License
This project is licensed under the AGPLv3 license, which encourages open collaboration and sharing. For more details, please refer to the LICENSE file.
For commercial use, please refer to our official Sinapsis website for information on obtaining a commercial license.
The LLama4TextToText template is licensed under the official Llama4 license
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sinapsis_llama_cpp-0.3.11.tar.gz.
File metadata
- Download URL: sinapsis_llama_cpp-0.3.11.tar.gz
- Upload date:
- Size: 37.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d951984829fc12d9c4895b4f49ee96e2725c80221d8cd997c7e7445dff31597
|
|
| MD5 |
0a9ec420947a906170f7584f09df0d6f
|
|
| BLAKE2b-256 |
f04f14354b23ef948764f1c622337f24990557eadeb5cdd457068204ad16cf38
|
File details
Details for the file sinapsis_llama_cpp-0.3.11-py3-none-any.whl.
File metadata
- Download URL: sinapsis_llama_cpp-0.3.11-py3-none-any.whl
- Upload date:
- Size: 39.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2d65944459c6583712a2cbbe7a54f9f83c59f2eed3d6bce10b6241e25feb179
|
|
| MD5 |
2c28356760dccdcf00f3ebb6ee56cc95
|
|
| BLAKE2b-256 |
19190865f2079f7566ac6486d2de1edb4bee268b114f5e5b9406106a3c14b0bd
|