Skip to main content

An integration between the llama.cpp LLM framework and Haystack

Project description

llama-cpp-haystack

PyPI - Version PyPI - Python Version


Custom component for Haystack (2.x) for running LLMs using the Llama.cpp LLM framework. This implementation leverages the Python Bindings for llama.cpp.

Table of Contents

Installation

pip install llama-cpp-haystack

The default install behaviour is to build llama.cpp for CPU only on Linux and Windows and use Metal on MacOS.

To install using the other backends, first install llama-cpp-python using the instructions on their installation documentation and then install llama-cpp-haystack.

For example, to use llama-cpp-haystack with the cuBLAS backend:

export LLAMA_CUBLAS=1
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
pip install llama-cpp-haystack

Usage

You can utilize the LlamaCppGenerator to load models quantized using llama.cpp (GGUF) for text generation.

Information about the supported models and model parameters can be found on the llama.cpp documentation.

The GGUF versions of popular models can be downloaded from HuggingFace.

Passing additional model parameters

The model_path, n_ctx, n_batch arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments.

The model_kwargs parameter can be used to pass additional arguments when initializing the model. In case of duplication, these kwargs override model_path, n_ctx, and n_batch init parameters.

See Llama.cpp's model documentation for more information on the available model arguments.

For example, to offload the model to GPU during initialization:

from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator

generator = LlamaCppGenerator(
    model_path="/content/openchat-3.5-1210.Q3_K_S.gguf", 
    n_ctx=512,
    n_batch=128,
    model_kwargs={"n_gpu_layers": -1}
)
generator.warm_up()

input = "Who is the best American actor?"
prompt = f"GPT4 Correct User: {input} <|end_of_turn|> GPT4 Correct Assistant:"

result = generator.run(prompt, generation_kwargs={"max_tokens": 128})
generated_text = result["replies"][0]

print(generated_text)

Passing generation parameters

The generation_kwargs parameter can be used to pass additional generation arguments like max_tokens, temperature, top_k, top_p, etc to the model during inference.

See Llama.cpp's create_completion documentation for more information on the available generation arguments.

For example, to set the max_tokens and temperature:

from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator

generator = LlamaCppGenerator(
    model_path="/content/openchat-3.5-1210.Q3_K_S.gguf",
    n_ctx=512,
    n_batch=128,
    generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)
generator.warm_up()

input = "Who is the best American actor?"
prompt = f"GPT4 Correct User: {input} <|end_of_turn|> GPT4 Correct Assistant:"

result = generator.run(prompt)
generated_text = result["replies"][0]

print(generated_text)

The generation_kwargs can also be passed to the run method of the generator directly:

from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator

generator = LlamaCppGenerator(
    model_path="/content/openchat-3.5-1210.Q3_K_S.gguf",
    n_ctx=512,
    n_batch=128,
)
generator.warm_up()

input = "Who is the best American actor?"
prompt = f"GPT4 Correct User: {input} <|end_of_turn|> GPT4 Correct Assistant:"

result = generator.run(
    prompt,
    generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)
generated_text = result["replies"][0]

print(generated_text)

Example

Below is the example Retrieval Augmented Generation pipeline that uses the Simple Wikipedia Dataset from HuggingFace. You can find more examples in the examples folder.

Load the dataset:

# Install HuggingFace Datasets using "pip install datasets"
from datasets import load_dataset
from haystack import Document, Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.document_stores import InMemoryDocumentStore

# Import LlamaCppGenerator
from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator

# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")

docs = [
    Document(
        content=doc["text"],
        meta={
            "title": doc["title"],
            "url": doc["url"],
        },
    )
    for doc in dataset
]

Index the documents to the InMemoryDocumentStore using the SentenceTransformersDocumentEmbedder and DocumentWriter:

doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")

# Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
indexing_pipeline.add_component(instance=DocumentWriter(document_store=doc_store), name="DocWriter")
indexing_pipeline.connect(connect_from="DocEmbedder", connect_to="DocWriter")

indexing_pipeline.run({"DocEmbedder": {"documents": docs}})

Create the Retrieval Augmented Generation (RAG) pipeline and add the LlamaCppGenerator to it:

# Prompt Template for the https://huggingface.co/openchat/openchat-3.5-1210 LLM
prompt_template = """GPT4 Correct User: Answer the question using the provided context.
Question: {{question}}
Context:
{% for doc in documents %}
    {{ doc.content }}
{% endfor %}
<|end_of_turn|>
GPT4 Correct Assistant:
"""

rag_pipeline = Pipeline()

text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")

# Load the LLM using LlamaCppGenerator
model_path = "openchat-3.5-1210.Q3_K_S.gguf"
generator = LlamaCppGenerator(model_path=model_path, n_ctx=4096, n_batch=128)

rag_pipeline.add_component(
    instance=text_embedder,
    name="text_embedder",
)
rag_pipeline.add_component(instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3), name="retriever")
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=generator, name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")

rag_pipeline.connect("text_embedder", "retriever")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

Run the pipeline:

question = "Which year did the Joker movie release?"
result = rag_pipeline.run(
    {
        "text_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}},
        "answer_builder": {"query": question},
    }
)

generated_answer = result["answer_builder"]["answers"][0]
print(generated_answer.data)
# The Joker movie was released on October 4, 2019.

License

llama-cpp-haystack is distributed under the terms of the Apache-2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_cpp_haystack-0.3.0.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_cpp_haystack-0.3.0-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file llama_cpp_haystack-0.3.0.tar.gz.

File metadata

  • Download URL: llama_cpp_haystack-0.3.0.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.27.0

File hashes

Hashes for llama_cpp_haystack-0.3.0.tar.gz
Algorithm Hash digest
SHA256 1acdaf5b67eb147543cee9fc908c368fd9f2a0eaff27cbaea55862cace985427
MD5 5a17292cba69d315ac3d530e03eb2737
BLAKE2b-256 e2bd629a9b7bccb8c46baf867eddab455ef32064d953f75a71576aabd0eb90e7

See more details on using hashes here.

File details

Details for the file llama_cpp_haystack-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_cpp_haystack-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1d66b3e7fe2bbab7824d559b2a94c60235af40f2dd93469354bcf12be194fdb2
MD5 3b8288ea418309382b1c9dceafff1e76
BLAKE2b-256 686f1f37c13050ede2082a0a0a7c34e48d94a4736c2e1f1fae68110306c1c273

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page