Sinapsis templates for LLM text completion with LLaMA-CPP

These details have not been verified by PyPI

Project links

Project description

Sinapsis LLaMA CPP

Sinapsis Templates for LLM text completion with LLaMA-CPP

🐍 Installation • 🚀 Features • 📚 Usage example • 🌐 Webapps 📙 Documentation • 🔍 License

The sinapsis-llama-cpp module provides a suite of templates to run LLMs with llama-cpp.

[!IMPORTANT] We now include support for Llama4 models!

To use them, install the dependency (if you have not installed sinapsis-llama-cpp[all]):

  uv pip install sinapsis-llama-cpp[llama-four] --extra-index-url https://pypi.sinapsis.tech

You need a HuggingFace token. See the official instructions and set it using:

  export HF_TOKEN=<token-provided-by-hf>

And test it through the cli or the webapp by changing the AGENT_CONFIG_PATH

[!NOTE] Llama 4 requires large GPUs to run the models. Nonetheless, running on smaller consumer-grade GPUs is possible, although a single inference may take hours

🐍 Installation

Install using your package manager of choice. We encourage the use of uv

Example with uv:

  uv pip install sinapsis-llama-cpp --extra-index-url https://pypi.sinapsis.tech

or with raw pip:

  pip install sinapsis-llama-cpp --extra-index-url https://pypi.sinapsis.tech

[!IMPORTANT] Templates may require extra dependencies. For development, we recommend installing the package with all the optional dependencies:

with uv:

  uv pip install sinapsis-llama-cpp[all] --extra-index-url https://pypi.sinapsis.tech

or with raw pip:

  pip install sinapsis-llama-cpp[all] --extra-index-url https://pypi.sinapsis.tech

🚀 Features

Templates Supported

LLaMATextCompletion: Template for text completion using LLaMA CPP.
Attributes
- init_args(LLaMAInitArgs, required): LLaMA model arguments.
  - llm_model_name(str, required): The name or path of the LLM model to use (e.g. 'TheBloke/Llama-2-7B-GGUF').
  - llm_model_file(str, required): The specific GGUF model file (e.g., 'llama-2-7b.Q2_K.gguf').
  - n_gpu_layers(int, optional): Number of layers to offload to the GPU (-1 for all). Defaults to 0.
  - use_mmap(bool, optional): Use 'memory-mapping' to load the model. Defaults to True.
  - use_mlock(bool, optional): Force the model to be kept in RAM. Defaults to False.
  - seed(int, optional): RNG seed for model initialization. Defaults to LLAMA_DEFAULT_SEED.
  - n_ctx(int, optional): The context window size. Defaults to 512.
  - n_batch(int, optional): The batch size for prompt processing. Defaults to 512.
  - n_ubatch(int, optional): The batch size for token generation. Defaults to 512.
  - n_threads(int, optional): CPU threads for generation. Defaults to None.
  - n_threads_batch(int, optional): CPU threads for batch processing. Defaults to None.
  - flash_attn(bool, optional): Enable Flash Attention if supported by the GPU. Defaults to False.
  - chat_format(str, optional): Chat template format (e.g., 'chatml'). Defaults to None.
  - verbose(bool, optional): Enable verbose logging from llama.cpp. Defaults to True.
- completion_args(LLaMACompletionArgs, required): Generation arguments to pass to the selected model.
  - temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to 0.2.
  - top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to 0.95.
  - top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to 40.
  - max_tokens(int, required): The maximum number of new tokens to generate.
  - min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults to 0.05.
  - stop(str | list[str], optional): Stop sequences to halt generation. Defaults to None.
  - seed(int, optional): Overrides the model's seed just for this call. Defaults to None.
  - repeat_penalty(float, optional): Penalty for repeating tokens (1.0 = no penalty). Defaults to 1.0.
  - presence_penalty(float, optional): Penalty for new tokens (0.0 = no penalty). Defaults to 0.0.
  - frequency_penalty(float, optional): Penalty for frequent tokens (0.0 = no penalty). Defaults to 0.0.
  - logit_bias(dict[int, float], optional): Applies a bias to specific tokens. Defaults to None.
- response_format(ResponseFormat, optional): Constrains the model output to a specific format. Use with type 'json_object' to enforce valid JSON output, optionally with a JSON Schema.
  - type(str, optional): The output format type ('text' or 'json_object'). Defaults to 'text'.
  - schema(SchemaDefinition, optional): Schema defining the expected JSON structure when type is 'json_object'.
    - properties(dict, optional): Mapping of field names to type strings or PropertyDefinition objects.
    - required(list[str], optional): List of required field names.
- chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.
- rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.
- system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.
- pattern(dict, optional): A regex pattern used to post-process the model's response.
- keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
- structure_output_key(str, optional): Key used to store parsed JSON structured output in the packet's generic_data when response_format type is 'json_object'. Defaults to 'structured_output'.
LLaMATextCompletionWithMCP: Template for text completion with MCP tool integration using LLaMA CPP.
Attributes
- init_args(LLaMAInitArgs, required): LLaMA model arguments.
  - llm_model_name(str, required): The name or path of the LLM model to use (e.g. 'TheBloke/Llama-2-7B-GGUF').
  - llm_model_file(str, required): The specific GGUF model file (e.g., 'llama-2-7b.Q2_K.gguf').
  - n_gpu_layers(int, optional): Number of layers to offload to the GPU (-1 for all). Defaults to 0.
  - use_mmap(bool, optional): Use 'memory-mapping' to load the model. Defaults to True.
  - use_mlock(bool, optional): Force the model to be kept in RAM. Defaults to False.
  - seed(int, optional): RNG seed for model initialization. Defaults to LLAMA_DEFAULT_SEED.
  - n_ctx(int, optional): The context window size. Defaults to 512.
  - n_batch(int, optional): The batch size for prompt processing. Defaults to 512.
  - n_ubatch(int, optional): The batch size for token generation. Defaults to 512.
  - n_threads(int, optional): CPU threads for generation. Defaults to None.
  - n_threads_batch(int, optional): CPU threads for batch processing. Defaults to None.
  - flash_attn(bool, optional): Enable Flash Attention if supported by the GPU. Defaults to False.
  - chat_format(str, optional): Chat template format (e.g., 'chatml'). Defaults to None.
  - verbose(bool, optional): Enable verbose logging from llama.cpp. Defaults to True.
- completion_args(LLaMACompletionArgs, required): Generation arguments to pass to the selected model.
  - temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to 0.2.
  - top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to 0.95.
  - top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to 40.
  - max_tokens(int, required): The maximum number of new tokens to generate.
  - min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults to 0.05.
  - stop(str | list[str], optional): Stop sequences to halt generation. Defaults to None.
  - seed(int, optional): Overrides the model's seed just for this call. Defaults to None.
  - repeat_penalty(float, optional): Penalty for repeating tokens (1.0 = no penalty). Defaults to 1.0.
  - presence_penalty(float, optional): Penalty for new tokens (0.0 = no penalty). Defaults to 0.0.
  - frequency_penalty(float, optional): Penalty for frequent tokens (0.0 = no penalty). Defaults to 0.0.
  - logit_bias(dict[int, float], optional): Applies a bias to specific tokens. Defaults to None.
- response_format(ResponseFormat, optional): Constrains the model output to a specific format. Use with type 'json_object' to enforce valid JSON output, optionally with a JSON Schema.
  - type(str, optional): The output format type ('text' or 'json_object'). Defaults to 'text'.
  - schema(SchemaDefinition, optional): Schema defining the expected JSON structure when type is 'json_object'.
    - properties(dict, optional): Mapping of field names to type strings or PropertyDefinition objects.
    - required(list[str], optional): List of required field names.
- chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.
- rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.
- system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.
- pattern(dict, optional): A regex pattern used to post-process the model's response.
- keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
- structure_output_key(str, optional): Key used to store parsed JSON structured output in the packet's generic_data when response_format type is 'json_object'. Defaults to 'structured_output'.
- tools_key(str, optional): Key used to extract the raw tools from the data container. Defaults to "".
- max_tool_retries(int, optional): Maximum consecutive tool execution failures before stopping. Defaults to 3.
- add_tool_to_prompt(bool, optional): Whether to automatically append tool descriptions to the system prompt. Defaults to True.
StreamingLLaMATextCompletion: Streaming version of LLaMATextCompletion for real-time response generation.

Attributes

Inherits all attributes from LLaMATextCompletion. The template yields response chunks as they are generated rather than waiting for the complete response.
LLama4TextToText: Template for text-to-text chat processing using the LLama 4 model.
Attributes
- init_args(LLaMA4InitArgs, required): LLaMA4 model arguments.
  - llm_model_name(str, required): The name or path of the LLM model to use (e.g., 'meta-llama/Llama-4-Scout-17B-16E-Instruct').
  - cache_dir(str, optional): Path to use for the model cache and download.
  - device_map(str, optional): Device mapping for from_pretrained. Defaults to auto.
  - torch_dtype(str, optional): Model tensor precision (e.g., 'auto', 'float16'). Defaults to auto.
  - max_memory(dict, optional): Max memory allocation per device. Defaults to None.
- completion_args(LLaMA4CompletionArgs, required): Generation arguments to pass to the selected model.
  - temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to 0.2.
  - top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to 0.95.
  - top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to 40.
  - max_length(int, optional): The maximum length of the sequence (prompt + generation). Defaults to 20.
  - max_new_tokens(int, optional): The maximum number of new tokens to generate. Defaults to None.
  - do_sample(bool, optional): Whether to use sampling (True) or greedy decoding (False). Defaults to True.
  - min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults to None.
  - repetition_penalty(float, optional): Penalty applied to repeated tokens (1.0 = no penalty). Defaults to 1.0.
- chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.
- rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.
- system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.
- pattern(dict, optional): A regex pattern used to post-process the model's response.
- keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
LLama4MultiModal: Template for multi modal chat processing using the LLama 4 model.
Attributes
- init_args(LLaMA4InitArgs, required): LLaMA4 model arguments.
  - llm_model_name(str, required): The name or path of the LLM model to use (e.g., 'meta-llama/Llama-4-Scout-17B-16E-Instruct').
  - cache_dir(str, optional): Path to use for the model cache and download.
  - device_map(str, optional): Device mapping for from_pretrained. Defaults to auto.
  - torch_dtype(str, optional): Model tensor precision (e.g., 'auto', 'float16'). Defaults to auto.
  - max_memory(dict, optional): Max memory allocation per device. Defaults to None.
- completion_args(LLaMA4CompletionArgs, required): Generation arguments to pass to the selected model.
  - temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to 0.2.
  - top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to 0.95.
  - top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to 40.
  - max_length(int, optional): The maximum length of the sequence (prompt + generation). Defaults to 20.
  - max_new_tokens(int, optional): The maximum number of new tokens to generate. Defaults to None.
  - do_sample(bool, optional): Whether to use sampling (True) or greedy decoding (False). Defaults to True.
  - min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults to None.
  - repetition_penalty(float, optional): Penalty applied to repeated tokens (1.0 = no penalty). Defaults to 1.0.
- chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.
- rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.
- system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.
- pattern(dict, optional): A regex pattern used to post-process the model's response.
- keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.

[!TIP] Use CLI command sinapsis info --all-template-names to show a list with all the available Template names installed with Sinapsis Data Tools.

[!TIP] Use CLI command sinapsis info --example-template-config TEMPLATE_NAME to produce an example Agent config for the Template specified in TEMPLATE_NAME.

For example, for LLaMATextCompletion use sinapsis info --example-template-config LLaMATextCompletion to produce the following example config:

agent:
  name: my_test_agent
templates:
- template_name: InputTemplate
  class_name: InputTemplate
  attributes: {}
- template_name: LLaMATextCompletion
  class_name: LLaMATextCompletion
  template_input: InputTemplate
  attributes:
    init_args:
      llm_model_name: '`replace_me:<class ''str''>`'
      llm_model_file: '`replace_me:<class ''str''>`'
      n_gpu_layers: 0
      use_mmap: true
      use_mlock: false
      seed: 4294967295
      n_ctx: 512
      n_batch: 512
      n_ubatch: 512
      n_threads: null
      n_threads_batch: null
      flash_attn: false
      chat_format: null
      verbose: true
    completion_args:
      temperature: 0.2
      top_p: 0.95
      top_k: 40
      max_tokens: '`replace_me:<class ''int''>`'
      min_p: 0.05
      stop: null
      seed: null
      repeat_penalty: 1.0
      presence_penalty: 0.0
      frequency_penalty: 0.0
      logit_bias: null
    chat_history_key: null
    rag_context_key: null
    system_prompt: null
    pattern: null
    keep_before: true
    structure_output_key: structured_output

📚 Usage example

The following agent passes a text message through a TextPacket and retrieves a response from a LLM

Config

agent:
  name: chat_completion
  description: Chatbot agent using DeepSeek-R1

templates:
- template_name: InputTemplate
  class_name: InputTemplate
  attributes: {}

- template_name: TextInput
  class_name: TextInput
  template_input: InputTemplate
  attributes:
    text: what is AI?

- template_name: LLaMATextCompletion
  class_name: LLaMATextCompletion
  template_input: TextInput
  attributes:
    init_args:
      llm_model_name: bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF
      llm_model_file: DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf
      n_ctx: 8192
      n_threads: 8
      n_gpu_layers: -1
      chat_format: chatml
      flash_attn: true
      seed: 10
    completion_args:
      max_tokens: 4096
      temperature: 0.2
      seed: 10
    system_prompt : 'You are a helpful assistant'
    pattern: "</think>"
    keep_before: False

🌐 Webapps

This module includes a webapp to interact with the model

[!IMPORTANT] To run the app you first need to clone this repository:

git clone git@github.com:Sinapsis-ai/sinapsis-chatbots.git
cd sinapsis-chatbots

[!NOTE] If you'd like to enable external app sharing in Gradio, export GRADIO_SHARE_APP=True

[!IMPORTANT] You can change the model name and the number of gpu_layers used by the model in case you have an Out of Memory (OOM) error

🐳 Docker

IMPORTANT This docker image depends on the sinapsis-nvidia:base image. Please refer to the official sinapsis instructions to Build with Docker.

Build the sinapsis-chatbots image:

docker compose -f docker/compose.yaml build

Start the container

docker compose -f docker/compose_apps.yaml up sinapsis-simple-chatbot -d

Check the status:

docker logs -f sinapsis-simple-chatbot

The logs will display the URL to access the webapp, e.g.,:

Running on local URL:  http://127.0.0.1:7860

NOTE: The url may be different, check the logs 4. To stop the app:

docker compose -f docker/compose_apps.yaml down

To use a different chatbot configuration (e.g. OpenAI-based chat), update the AGENT_CONFIG_PATH environmental variable to point to the desired YAML file.

For example, to use OpenAI chat:

environment:
 AGENT_CONFIG_PATH: webapps/configs/llama_cpp_simple_chatbot/openai_simple_chat.yaml
 OPENAI_API_KEY: your_api_key

💻 UV

Export the environment variable to install the python bindings for llama-cpp

export CMAKE_ARGS="-DGGML_CUDA=on"
export FORCE_CMAKE="1"

export CUDACXX:

export CUDACXX=$(command -v nvcc)

Create the virtual environment and sync dependencies:

uv sync --frozen

Install the wheel:

uv pip install sinapsis-chatbots[all] --extra-index-url https://pypi.sinapsis.tech

Run the webapp:

uv run webapps/llama_cpp_simple_chatbot.py

NOTE: To use OpenAI for the simple chatbot, set your API key and specify the correct configuration file

export AGENT_CONFIG_PATH=webapps/configs/llama_cpp_simple_chatbot/openai_simple_chat.yaml
export OPENAI_API_KEY=your_api_key

and run step 5 again

The terminal will display the URL to access the webapp, e.g.:

NOTE: The url can be different, check the output of the terminal

Running on local URL:  http://127.0.0.1:7860

📙 Documentation

Documentation for this and other sinapsis packages is available on the sinapsis website

Tutorials for different projects within sinapsis are available at sinapsis tutorials page

🔍 License

This project is licensed under the AGPLv3 license, which encourages open collaboration and sharing. For more details, please refer to the LICENSE file.

For commercial use, please refer to our official Sinapsis website for information on obtaining a commercial license.

The LLama4TextToText template is licensed under the official Llama4 license

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

Mar 25, 2026

0.4.4

Mar 3, 2026

This version

0.4.3

Feb 27, 2026

0.4.2

Feb 26, 2026

0.4.1

Feb 25, 2026

0.4.0

Feb 19, 2026

0.3.14

Jan 15, 2026

0.3.13

Dec 9, 2025

0.3.12

Nov 10, 2025

0.3.11

Nov 3, 2025

0.3.10

Sep 8, 2025

0.3.9

Aug 29, 2025

0.3.8

Aug 19, 2025

0.3.7

Aug 5, 2025

0.3.6

Jul 28, 2025

0.3.5

Jun 3, 2025

0.3.4

May 2, 2025

0.3.3

Apr 30, 2025

0.3.2

Apr 30, 2025

0.3.1

Apr 29, 2025

0.3.0

Apr 9, 2025

0.2.0

Apr 1, 2025

0.1.0

Mar 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sinapsis_llama_cpp-0.4.3.tar.gz (41.7 kB view details)

Uploaded Feb 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sinapsis_llama_cpp-0.4.3-py3-none-any.whl (43.6 kB view details)

Uploaded Feb 27, 2026 Python 3

File details

Details for the file sinapsis_llama_cpp-0.4.3.tar.gz.

File metadata

Download URL: sinapsis_llama_cpp-0.4.3.tar.gz
Upload date: Feb 27, 2026
Size: 41.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.17

File hashes

Hashes for sinapsis_llama_cpp-0.4.3.tar.gz
Algorithm	Hash digest
SHA256	`117f41b6ee8e8099170a21c9284bd8e3170f9be06079d556a829e28286cc6d55`
MD5	`1287b9ea3d64dce917f9efb045afcf96`
BLAKE2b-256	`1107840d5e8ea34b8046ad71b146d0174ba0ca7bfceef4394da5df917321e91f`

See more details on using hashes here.

File details

Details for the file sinapsis_llama_cpp-0.4.3-py3-none-any.whl.

File metadata

Download URL: sinapsis_llama_cpp-0.4.3-py3-none-any.whl
Upload date: Feb 27, 2026
Size: 43.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.17

File hashes

Hashes for sinapsis_llama_cpp-0.4.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a1acb59d35f5c8e340221e7140b6b65c27c041a879c276cf4dfcf48c4c771878`
MD5	`42160e85b04c26e6128afb0074dfb43e`
BLAKE2b-256	`05700d0e321d58fb7d2805dee8c930d19cb7283eea4e01e72f3fdb3fba5885c4`

See more details on using hashes here.

sinapsis-llama-cpp 0.4.3

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Project description

Sinapsis LLaMA CPP

Sinapsis Templates for LLM text completion with LLaMA-CPP

🐍 Installation

🚀 Features

Templates Supported

📚 Usage example

🌐 Webapps

📙 Documentation

🔍 License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes