llama-index llms llama cpp integration

These details have not been verified by PyPI

Project description

LlamaIndex Llms Integration: Llama Cpp

Installation

To get the best performance out of LlamaCPP, it is recommended to install the package so that it is compiled with GPU support. A full guide for installing this way is here.

Full MACOS instructions are also here.

In general:

Use CuBLAS if you have CUDA and an NVidia GPU
Use METAL if you are running on an M1/M2 MacBook
Use CLBLAST if you are running on an AMD/Intel GPU

Them, install the required llama-index packages:

pip install llama-index-embeddings-huggingface
pip install llama-index-llms-llama-cpp

Basic Usage

Initialize LlamaCPP

Set up the model URL and initialize the LlamaCPP LLM:

from llama_index.llms.llama_cpp import LlamaCPP
from transformers import AutoTokenizer

model_url = "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q3_k_m.gguf"

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")


def messages_to_prompt(messages):
    messages = [{"role": m.role.value, "content": m.content} for m in messages]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    return prompt


def completion_to_prompt(completion):
    messages = [{"role": "user", "content": completion}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    return prompt


llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=16384,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

Generate Completions

Use the complete method to generate a response:

response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)

Stream Completions

You can also stream completions for a prompt:

response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
    print(response.delta, end="", flush=True)

Set Up Query Engine with LlamaCPP

Change the global tokenizer to match the LLM:

from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer

set_global_tokenizer(
    AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct").encode
)

Use Hugging Face Embeddings

Set up the embedding model and load documents:

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
documents = SimpleDirectoryReader(
    "../../../examples/paul_graham_essay/data"
).load_data()

Create Vector Store Index

Create a vector store index from the loaded documents:

index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

Set Up Query Engine

Set up the query engine with the LlamaCPP LLM:

query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("What did the author do growing up?")
print(response)

LLM Implementation example

https://docs.llamaindex.ai/en/stable/examples/llm/llama_cpp/

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.6.0

Mar 12, 2026

0.5.1

Sep 8, 2025

0.5.0

Jul 30, 2025

0.4.0

Jan 27, 2025

0.3.0

Nov 18, 2024

0.2.3

Oct 8, 2024

0.2.2

Sep 13, 2024

0.2.1

Aug 22, 2024

0.2.0

Aug 22, 2024

0.1.4

Jun 14, 2024

0.1.3

Feb 21, 2024

0.1.2

Feb 20, 2024

0.1.1

Feb 12, 2024

0.1.0

Feb 10, 2024

0.0.1

Feb 2, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_llms_llama_cpp-0.6.0.tar.gz (8.2 kB view details)

Uploaded Mar 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llama_index_llms_llama_cpp-0.6.0-py3-none-any.whl (8.4 kB view details)

Uploaded Mar 12, 2026 Python 3

File details

Details for the file llama_index_llms_llama_cpp-0.6.0.tar.gz.

File metadata

Download URL: llama_index_llms_llama_cpp-0.6.0.tar.gz
Upload date: Mar 12, 2026
Size: 8.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llama_index_llms_llama_cpp-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`9313024b38b4efb8b83750f4b5ac233c0d052356ede356455e5a2c28348d37d9`
MD5	`e1924ece02cbad1af45ba0a90be10089`
BLAKE2b-256	`0aae829a56deb1690c531e5f7187e33c4778a4d5928a16e02374aadb4a4a8d64`

See more details on using hashes here.

File details

Details for the file llama_index_llms_llama_cpp-0.6.0-py3-none-any.whl.

File metadata

Download URL: llama_index_llms_llama_cpp-0.6.0-py3-none-any.whl
Upload date: Mar 12, 2026
Size: 8.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llama_index_llms_llama_cpp-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8d7990cff2bfda26f4041e55eb90f831a4e3bcd5f97072dac8b74b934a087010`
MD5	`6eb86539a7ca943fa3339702e59559d5`
BLAKE2b-256	`6df4e026fe55ad70fe36164e8560e57d13fd9be129d18b37159b400ac0aa79cf`

See more details on using hashes here.

llama-index-llms-llama-cpp 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LlamaIndex Llms Integration: Llama Cpp

Installation

Basic Usage

Initialize LlamaCPP

Generate Completions

Stream Completions

Set Up Query Engine with LlamaCPP

Use Hugging Face Embeddings

Create Vector Store Index

Set Up Query Engine

LLM Implementation example

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes