llama-index llms llama cpp integration
Project description
LlamaIndex Llms Integration: Llama Cpp
Installation
To get the best performance out of LlamaCPP, it is recommended to install the package so that it is compiled with GPU support. A full guide for installing this way is here.
Full MACOS instructions are also here.
In general:
- Use
CuBLASif you have CUDA and an NVidia GPU - Use
METALif you are running on an M1/M2 MacBook - Use
CLBLASTif you are running on an AMD/Intel GPU
Them, install the required llama-index packages:
pip install llama-index-embeddings-huggingface
pip install llama-index-llms-llama-cpp
Basic Usage
Initialize LlamaCPP
Set up the model URL and initialize the LlamaCPP LLM:
from llama_index.llms.llama_cpp import LlamaCPP
from transformers import AutoTokenizer
model_url = "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q3_k_m.gguf"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
def messages_to_prompt(messages):
messages = [{"role": m.role.value, "content": m.content} for m in messages]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
return prompt
def completion_to_prompt(completion):
messages = [{"role": "user", "content": completion}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
return prompt
llm = LlamaCPP(
# You can pass in the URL to a GGML model to download it automatically
model_url=model_url,
# optionally, you can set the path to a pre-downloaded model instead of model_url
model_path=None,
temperature=0.1,
max_new_tokens=256,
# llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
context_window=16384,
# kwargs to pass to __call__()
generate_kwargs={},
# kwargs to pass to __init__()
# set to at least 1 to use GPU
model_kwargs={"n_gpu_layers": -1},
# transform inputs into Llama2 format
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)
Generate Completions
Use the complete method to generate a response:
response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)
Stream Completions
You can also stream completions for a prompt:
response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
print(response.delta, end="", flush=True)
Set Up Query Engine with LlamaCPP
Change the global tokenizer to match the LLM:
from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer
set_global_tokenizer(
AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct").encode
)
Use Hugging Face Embeddings
Set up the embedding model and load documents:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
documents = SimpleDirectoryReader(
"../../../examples/paul_graham_essay/data"
).load_data()
Create Vector Store Index
Create a vector store index from the loaded documents:
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
Set Up Query Engine
Set up the query engine with the LlamaCPP LLM:
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("What did the author do growing up?")
print(response)
LLM Implementation example
https://docs.llamaindex.ai/en/stable/examples/llm/llama_cpp/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_index_llms_llama_cpp-0.5.1.tar.gz.
File metadata
- Download URL: llama_index_llms_llama_cpp-0.5.1.tar.gz
- Upload date:
- Size: 8.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8d952622da20f1817775f0b40db0575397e5ba83e1103a769a211f105c41ce4
|
|
| MD5 |
997f5fbdc7e41e1a428d1747d950e4db
|
|
| BLAKE2b-256 |
b4660f60c34b6004852bb65dcb300f1bb0805f7a88e51099b424d878223c02cc
|
File details
Details for the file llama_index_llms_llama_cpp-0.5.1-py3-none-any.whl.
File metadata
- Download URL: llama_index_llms_llama_cpp-0.5.1-py3-none-any.whl
- Upload date:
- Size: 8.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2721423a41eee6fa706bb3f85acb6f5feadc948265dcc724d105a00198bbf5e4
|
|
| MD5 |
18b2d65aa28e176669706dd8e74cc1f2
|
|
| BLAKE2b-256 |
c98986e350ccde0f383a543c02bbf75b708a2acd616468a1002ee2e1b9d51911
|