Skip to main content

Sinapsis templates for LLM text completion with LLaMA-CPP

Project description



Sinapsis LLaMA CPP

Sinapsis Templates for LLM text completion with LLaMA-CPP

🐍 Installation🚀 Features📚 Usage example🌐 Webapps 📙 Documentation🔍 License

The sinapsis-llama-cpp module provides a suite of templates to run LLMs with llama-cpp.

[!IMPORTANT] We now include support for Llama4 models!

To use them, install the dependency (if you have not installed sinapsis-llama-cpp[all]):

  uv pip install sinapsis-llama-cpp[llama-four] --extra-index-url https://pypi.sinapsis.tech

You need a HuggingFace token. See the official instructions and set it using:

  export HF_TOKEN=<token-provided-by-hf>

And test it through the cli or the webapp by changing the AGENT_CONFIG_PATH

[!NOTE] Llama 4 requires large GPUs to run the models. Nonetheless, running on smaller consumer-grade GPUs is possible, although a single inference may take hours

🐍 Installation

Install using your package manager of choice. We encourage the use of uv

Example with uv:

  uv pip install sinapsis-llama-cpp --extra-index-url https://pypi.sinapsis.tech

or with raw pip:

  pip install sinapsis-llama-cpp --extra-index-url https://pypi.sinapsis.tech

[!IMPORTANT] Templates may require extra dependencies. For development, we recommend installing the package with all the optional dependencies:

with uv:

  uv pip install sinapsis-llama-cpp[all] --extra-index-url https://pypi.sinapsis.tech

or with raw pip:

  pip install sinapsis-llama-cpp[all] --extra-index-url https://pypi.sinapsis.tech

🚀 Features

Templates Supported

  • LLaMATextCompletion: Template for text completion using LLaMA CPP.

    Attributes
    • init_args(LLaMAInitArgs, required): LLaMA model arguments.
      • llm_model_name(str, required): The name or path of the LLM model to use (e.g. 'TheBloke/Llama-2-7B-GGUF').
      • llm_model_file(str, required): The specific GGUF model file (e.g., 'llama-2-7b.Q2_K.gguf').
      • n_gpu_layers(int, optional): Number of layers to offload to the GPU (-1 for all). Defaults to 0.
      • use_mmap(bool, optional): Use 'memory-mapping' to load the model. Defaults to True.
      • use_mlock(bool, optional): Force the model to be kept in RAM. Defaults to False.
      • seed(int, optional): RNG seed for model initialization. Defaults to LLAMA_DEFAULT_SEED.
      • n_ctx(int, optional): The context window size. Defaults to 512.
      • n_batch(int, optional): The batch size for prompt processing. Defaults to 512.
      • n_ubatch(int, optional): The batch size for token generation. Defaults to 512.
      • n_threads(int, optional): CPU threads for generation. Defaults to None.
      • n_threads_batch(int, optional): CPU threads for batch processing. Defaults to None.
      • flash_attn(bool, optional): Enable Flash Attention if supported by the GPU. Defaults to False.
      • chat_format(str, optional): Chat template format (e.g., 'chatml'). Defaults to None.
      • verbose(bool, optional): Enable verbose logging from llama.cpp. Defaults to True.
    • completion_args(LLaMACompletionArgs, required): Generation arguments to pass to the selected model.
      • temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to 0.2.
      • top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to 0.95.
      • top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to 40.
      • max_tokens(int, required): The maximum number of new tokens to generate.
      • min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults to 0.05.
      • stop(str | list[str], optional): Stop sequences to halt generation. Defaults to None.
      • seed(int, optional): Overrides the model's seed just for this call. Defaults to None.
      • repeat_penalty(float, optional): Penalty for repeating tokens (1.0 = no penalty). Defaults to 1.0.
      • presence_penalty(float, optional): Penalty for new tokens (0.0 = no penalty). Defaults to 0.0.
      • frequency_penalty(float, optional): Penalty for frequent tokens (0.0 = no penalty). Defaults to 0.0.
      • logit_bias(dict[int, float], optional): Applies a bias to specific tokens. Defaults to None.
    • chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.
    • rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.
    • system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.
    • pattern(dict, optional): A regex pattern used to post-process the model's response.
    • keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
  • LLaMATextCompletionWithMCP: Template for text completion with MCP tool integration using LLaMA CPP.

    Attributes
    • init_args(LLaMAInitArgs, required): LLaMA model arguments.
      • llm_model_name(str, required): The name or path of the LLM model to use (e.g. 'TheBloke/Llama-2-7B-GGUF').
      • llm_model_file(str, required): The specific GGUF model file (e.g., 'llama-2-7b.Q2_K.gguf').
      • n_gpu_layers(int, optional): Number of layers to offload to the GPU (-1 for all). Defaults to 0.
      • use_mmap(bool, optional): Use 'memory-mapping' to load the model. Defaults to True.
      • use_mlock(bool, optional): Force the model to be kept in RAM. Defaults to False.
      • seed(int, optional): RNG seed for model initialization. Defaults to LLAMA_DEFAULT_SEED.
      • n_ctx(int, optional): The context window size. Defaults to 512.
      • n_batch(int, optional): The batch size for prompt processing. Defaults to 512.
      • n_ubatch(int, optional): The batch size for token generation. Defaults to 512.
      • n_threads(int, optional): CPU threads for generation. Defaults to None.
      • n_threads_batch(int, optional): CPU threads for batch processing. Defaults to None.
      • flash_attn(bool, optional): Enable Flash Attention if supported by the GPU. Defaults to False.
      • chat_format(str, optional): Chat template format (e.g., 'chatml'). Defaults to None.
      • verbose(bool, optional): Enable verbose logging from llama.cpp. Defaults to True.
    • completion_args(LLaMACompletionArgs, required): Generation arguments to pass to the selected model.
      • temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to 0.2.
      • top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to 0.95.
      • top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to 40.
      • max_tokens(int, required): The maximum number of new tokens to generate.
      • min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults to 0.05.
      • stop(str | list[str], optional): Stop sequences to halt generation. Defaults to None.
      • seed(int, optional): Overrides the model's seed just for this call. Defaults to None.
      • repeat_penalty(float, optional): Penalty for repeating tokens (1.0 = no penalty). Defaults to 1.0.
      • presence_penalty(float, optional): Penalty for new tokens (0.0 = no penalty). Defaults to 0.0.
      • frequency_penalty(float, optional): Penalty for frequent tokens (0.0 = no penalty). Defaults to 0.0.
      • logit_bias(dict[int, float], optional): Applies a bias to specific tokens. Defaults to None.
    • chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.
    • rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.
    • system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.
    • pattern(dict, optional): A regex pattern used to post-process the model's response.
    • keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
    • tools_key(str, optional): Key used to extract the raw tools from the data container. Defaults to "".
    • max_tool_retries(int, optional): Maximum consecutive tool execution failures before stopping. Defaults to 3.
    • add_tool_to_prompt(bool, optional): Whether to automatically append tool descriptions to the system prompt. Defaults to True.
  • LLama4TextToText: Template for text-to-text chat processing using the LLama 4 model.

    Attributes
    • init_args(LLaMA4InitArgs, required): LLaMA4 model arguments.
      • llm_model_name(str, required): The name or path of the LLM model to use (e.g., 'meta-llama/Llama-4-Scout-17B-16E-Instruct').
      • cache_dir(str, optional): Path to use for the model cache and download.
      • device_map(str, optional): Device mapping for from_pretrained. Defaults to auto.
      • torch_dtype(str, optional): Model tensor precision (e.g., 'auto', 'float16'). Defaults to auto.
      • max_memory(dict, optional): Max memory allocation per device. Defaults to None.
    • completion_args(LLMCompletionArgs, required): Generation arguments to pass to the selected model.
      • temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to 0.2.
      • top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to 0.95.
      • top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to 40.
      • max_length(int, optional): The maximum length of the sequence (prompt + generation). Defaults to 20.
      • max_new_tokens(int, optional): The maximum number of new tokens to generate. Defaults to None.
      • do_sample(bool, optional): Whether to use sampling (True) or greedy decoding (False). Defaults to True.
      • min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults to None.
      • repetition_penalty(float, optional): Penalty applied to repeated tokens (1.0 = no penalty). Defaults to 1.0.
    • chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.
    • rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.
    • system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.
    • pattern(dict, optional): A regex pattern used to post-process the model's response.
    • keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
  • LLama4MultiModal: Template for multi modal chat processing using the LLama 4 model.

    Attributes
    • init_args(LLaMA4InitArgs, required): LLaMA4 model arguments.
      • llm_model_name(str, required): The name or path of the LLM model to use (e.g., 'meta-llama/Llama-4-Scout-17B-16E-Instruct').
      • cache_dir(str, optional): Path to use for the model cache and download.
      • device_map(str, optional): Device mapping for from_pretrained. Defaults to auto.
      • torch_dtype(str, optional): Model tensor precision (e.g., 'auto', 'float16'). Defaults to auto.
      • max_memory(dict, optional): Max memory allocation per device. Defaults to None.
    • completion_args(LLMCompletionArgs, required): Generation arguments to pass to the selected model.
      • temperature(float, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to 0.2.
      • top_p(float, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to 0.95.
      • top_k(int, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to 40.
      • max_length(int, optional): The maximum length of the sequence (prompt + generation). Defaults to 20.
      • max_new_tokens(int, optional): The maximum number of new tokens to generate. Defaults to None.
      • do_sample(bool, optional): Whether to use sampling (True) or greedy decoding (False). Defaults to True.
      • min_p(float, optional): Min-p sampling, filters tokens below this probability. Defaults to None.
      • repetition_penalty(float, optional): Penalty applied to repeated tokens (1.0 = no penalty). Defaults to 1.0.
    • chat_history_key(str, optional): Key in the packet's generic_data to find the conversation history.
    • rag_context_key(str, optional): Key in the packet's generic_data to find RAG context to inject.
    • system_prompt(str | Path, optional): The system prompt (or path to one) to instruct the model.
    • pattern(dict, optional): A regex pattern used to post-process the model's response.
    • keep_before(bool, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.

[!TIP] Use CLI command sinapsis info --all-template-names to show a list with all the available Template names installed with Sinapsis Data Tools.

[!TIP] Use CLI command sinapsis info --example-template-config TEMPLATE_NAME to produce an example Agent config for the Template specified in TEMPLATE_NAME.

For example, for LLaMATextCompletion use sinapsis info --example-template-config LLaMATextCompletion to produce the following example config:

agent:
  name: my_test_agent
templates:
- template_name: InputTemplate
  class_name: InputTemplate
  attributes: {}
- template_name: LLaMATextCompletion
  class_name: LLaMATextCompletion
  template_input: InputTemplate
  attributes:
    init_args:
      llm_model_name: '`replace_me:<class ''str''>`'
      llm_model_file: '`replace_me:<class ''str''>`'
      n_gpu_layers: 0
      use_mmap: true
      use_mlock: false
      seed: 4294967295
      n_ctx: 512
      n_batch: 512
      n_ubatch: 512
      n_threads: null
      n_threads_batch: null
      flash_attn: false
      chat_format: null
      verbose: true
    completion_args:
      temperature: 0.2
      top_p: 0.95
      top_k: 40
      max_tokens: '`replace_me:<class ''int''>`'
      min_p: 0.05
      stop: null
      seed: null
      repeat_penalty: 1.0
      presence_penalty: 0.0
      frequency_penalty: 0.0
      logit_bias: null
    chat_history_key: null
    rag_context_key: null
    system_prompt: null
    pattern: null
    keep_before: true

📚 Usage example

The following agent passes a text message through a TextPacket and retrieves a response from a LLM
Config
agent:
  name: chat_completion
  description: Chatbot agent using DeepSeek-R1

templates:
- template_name: InputTemplate
  class_name: InputTemplate
  attributes: {}

- template_name: TextInput
  class_name: TextInput
  template_input: InputTemplate
  attributes:
    text: what is AI?

- template_name: LLaMATextCompletion
  class_name: LLaMATextCompletion
  template_input: TextInput
  attributes:
    init_args:
      llm_model_name: bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF
      llm_model_file: DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf
      n_ctx: 8192
      n_threads: 8
      n_gpu_layers: -1
      chat_format: chatml
      flash_attn: true
      seed: 10
    completion_args:
      max_tokens: 4096
      temperature: 0.2
      seed: 10
    system_prompt : 'You are a helpful assistant'
    pattern: "</think>"
    keep_before: False

🌐 Webapps

This module includes a webapp to interact with the model

[!IMPORTANT] To run the app you first need to clone this repository:

git clone git@github.com:Sinapsis-ai/sinapsis-chatbots.git
cd sinapsis-chatbots

[!NOTE] If you'd like to enable external app sharing in Gradio, export GRADIO_SHARE_APP=True

[!IMPORTANT] You can change the model name and the number of gpu_layers used by the model in case you have an Out of Memory (OOM) error

🐳 Docker

IMPORTANT This docker image depends on the sinapsis-nvidia:base image. Please refer to the official sinapsis instructions to Build with Docker.

  1. Build the sinapsis-chatbots image:
docker compose -f docker/compose.yaml build
  1. Start the container
docker compose -f docker/compose_apps.yaml up sinapsis-simple-chatbot -d
  1. Check the status:
docker logs -f sinapsis-simple-chatbot
  1. The logs will display the URL to access the webapp, e.g.,:
Running on local URL:  http://127.0.0.1:7860

NOTE: The url may be different, check the logs 4. To stop the app:

docker compose -f docker/compose_apps.yaml down

To use a different chatbot configuration (e.g. OpenAI-based chat), update the AGENT_CONFIG_PATH environmental variable to point to the desired YAML file.

For example, to use OpenAI chat:

environment:
 AGENT_CONFIG_PATH: webapps/configs/openai_simple_chat.yaml
 OPENAI_API_KEY: your_api_key
💻 UV
  1. Export the environment variable to install the python bindings for llama-cpp
export CMAKE_ARGS="-DGGML_CUDA=on"
export FORCE_CMAKE="1"
  1. export CUDACXX:
export CUDACXX=$(command -v nvcc)
  1. Create the virtual environment and sync dependencies:
uv sync --frozen
  1. Install the wheel:
uv pip install sinapsis-chatbots[all] --extra-index-url https://pypi.sinapsis.tech
  1. Run the webapp:
uv run webapps/llama_cpp_simple_chatbot.py

NOTE: To use OpenAI for the simple chatbot, set your API key and specify the correct configuration file

export AGENT_CONFIG_PATH=webapps/configs/openai_simple_chat.yaml
export OPENAI_API_KEY=your_api_key

and run step 5 again

  1. The terminal will display the URL to access the webapp, e.g.:

NOTE: The url can be different, check the output of the terminal

Running on local URL:  http://127.0.0.1:7860

📙 Documentation

Documentation for this and other sinapsis packages is available on the sinapsis website

Tutorials for different projects within sinapsis are available at sinapsis tutorials page

🔍 License

This project is licensed under the AGPLv3 license, which encourages open collaboration and sharing. For more details, please refer to the LICENSE file.

For commercial use, please refer to our official Sinapsis website for information on obtaining a commercial license.

The LLama4TextToText template is licensed under the official Llama4 license

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sinapsis_llama_cpp-0.3.14.tar.gz (37.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sinapsis_llama_cpp-0.3.14-py3-none-any.whl (39.3 kB view details)

Uploaded Python 3

File details

Details for the file sinapsis_llama_cpp-0.3.14.tar.gz.

File metadata

  • Download URL: sinapsis_llama_cpp-0.3.14.tar.gz
  • Upload date:
  • Size: 37.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.17

File hashes

Hashes for sinapsis_llama_cpp-0.3.14.tar.gz
Algorithm Hash digest
SHA256 3dff22c4b18d91944fdf12e3e05c80fad7ae1ef2599ede0fa80ed32b40ebfa7b
MD5 e180480ca5ed18446e00d762755d6cb9
BLAKE2b-256 5fd940266acb6689551ec7567c4f66582904d4d53c2b39d051a86f2fe035aa70

See more details on using hashes here.

File details

Details for the file sinapsis_llama_cpp-0.3.14-py3-none-any.whl.

File metadata

File hashes

Hashes for sinapsis_llama_cpp-0.3.14-py3-none-any.whl
Algorithm Hash digest
SHA256 8f12f2eb9c8396f089bfeb3ca4a2014784357448906851e674258fb96ae651d8
MD5 92df842c5cab5f794451b91d497883ee
BLAKE2b-256 eaf550682012b50aecbafe24b0be6d456843258f152a4d9745ca8b70b66e24c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page