Skip to main content

llama-index llms llama cpp integration

Project description

LlamaIndex Llms Integration: Llama Cpp

Installation

To get the best performance out of LlamaCPP, it is recommended to install the package so that it is compiled with GPU support. A full guide for installing this way is here.

Full MACOS instructions are also here.

In general:

  • Use CuBLAS if you have CUDA and an NVidia GPU
  • Use METAL if you are running on an M1/M2 MacBook
  • Use CLBLAST if you are running on an AMD/Intel GPU

Them, install the required llama-index packages:

pip install llama-index-embeddings-huggingface
pip install llama-index-llms-llama-cpp

Basic Usage

Initialize LlamaCPP

Set up the model URL and initialize the LlamaCPP LLM:

from llama_index.llms.llama_cpp import LlamaCPP
from transformers import AutoTokenizer

model_url = "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q3_k_m.gguf"

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")


def messages_to_prompt(messages):
    messages = [{"role": m.role.value, "content": m.content} for m in messages]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    return prompt


def completion_to_prompt(completion):
    messages = [{"role": "user", "content": completion}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    return prompt


llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=16384,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

Generate Completions

Use the complete method to generate a response:

response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)

Stream Completions

You can also stream completions for a prompt:

response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
    print(response.delta, end="", flush=True)

Set Up Query Engine with LlamaCPP

Change the global tokenizer to match the LLM:

from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer

set_global_tokenizer(
    AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct").encode
)

Use Hugging Face Embeddings

Set up the embedding model and load documents:

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
documents = SimpleDirectoryReader(
    "../../../examples/paul_graham_essay/data"
).load_data()

Create Vector Store Index

Create a vector store index from the loaded documents:

index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

Set Up Query Engine

Set up the query engine with the LlamaCPP LLM:

query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("What did the author do growing up?")
print(response)

LLM Implementation example

https://docs.llamaindex.ai/en/stable/examples/llm/llama_cpp/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_llms_llama_cpp-0.6.0.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_llms_llama_cpp-0.6.0-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_llms_llama_cpp-0.6.0.tar.gz.

File metadata

  • Download URL: llama_index_llms_llama_cpp-0.6.0.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llama_index_llms_llama_cpp-0.6.0.tar.gz
Algorithm Hash digest
SHA256 9313024b38b4efb8b83750f4b5ac233c0d052356ede356455e5a2c28348d37d9
MD5 e1924ece02cbad1af45ba0a90be10089
BLAKE2b-256 0aae829a56deb1690c531e5f7187e33c4778a4d5928a16e02374aadb4a4a8d64

See more details on using hashes here.

File details

Details for the file llama_index_llms_llama_cpp-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: llama_index_llms_llama_cpp-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llama_index_llms_llama_cpp-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8d7990cff2bfda26f4041e55eb90f831a4e3bcd5f97072dac8b74b934a087010
MD5 6eb86539a7ca943fa3339702e59559d5
BLAKE2b-256 6df4e026fe55ad70fe36164e8560e57d13fd9be129d18b37159b400ac0aa79cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page