llama-index llms openvino integration
Project description
LlamaIndex Llms Integration: Openvino
Installation
To install the required packages, run:
%pip install llama-index-llms-openvino transformers huggingface_hub
!pip install llama-index
Setup
Define Functions for Prompt Handling
You will need functions to convert messages and completions into prompts:
from llama_index.llms.openvino import OpenVINOLLM
def messages_to_prompt(messages):
prompt = ""
for message in messages:
if message.role == "system":
prompt += f"<|system|>\n{message.content}</s>\n"
elif message.role == "user":
prompt += f"<|user|>\n{message.content}</s>\n"
elif message.role == "assistant":
prompt += f"<|assistant|>\n{message.content}</s>\n"
# Ensure we start with a system prompt, insert blank if needed
if not prompt.startswith("<|system|>\n"):
prompt = "<|system|>\n</s>\n" + prompt
# Add final assistant prompt
prompt = prompt + "<|assistant|>\n"
return prompt
def completion_to_prompt(completion):
return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"
Model Loading
Models can be loaded by specifying parameters using the OpenVINOLLM
method. If you have an Intel GPU, specify device_map="gpu"
to run inference on it:
ov_config = {
"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1",
"CACHE_DIR": "",
}
ov_llm = OpenVINOLLM(
model_id_or_path="HuggingFaceH4/zephyr-7b-beta",
context_window=3900,
max_new_tokens=256,
model_kwargs={"ov_config": ov_config},
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
device_map="cpu",
)
response = ov_llm.complete("What is the meaning of life?")
print(str(response))
Inference with Local OpenVINO Model
Export your model to the OpenVINO IR format using the CLI and load it from a local folder. It’s recommended to apply 8 or 4-bit weight quantization to reduce inference latency and model footprint:
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int8 ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int4 ov_model_dir
You can then load the model from the specified directory:
ov_llm = OpenVINOLLM(
model_id_or_path="ov_model_dir",
context_window=3900,
max_new_tokens=256,
model_kwargs={"ov_config": ov_config},
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
device_map="gpu",
)
Additional Optimization
You can get additional inference speed improvements with dynamic quantization of activations and KV-cache quantization. Enable these options with ov_config
as follows:
ov_config = {
"KV_CACHE_PRECISION": "u8",
"DYNAMIC_QUANTIZATION_GROUP_SIZE": "32",
"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1",
"CACHE_DIR": "",
}
Streaming Responses
To use the streaming capabilities, you can use the stream_complete
and stream_chat
methods:
Using stream_complete
response = ov_llm.stream_complete("Who is Paul Graham?")
for r in response:
print(r.delta, end="")
Using stream_chat
from llama_index.core.llms import ChatMessage
messages = [
ChatMessage(
role="system", content="You are a pirate with a colorful personality"
),
ChatMessage(role="user", content="What is your name"),
]
resp = ov_llm.stream_chat(messages)
for r in resp:
print(r.delta, end="")
LLM Implementation example
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for llama_index_llms_openvino-0.3.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ecb03a2d3c56e6d5b317208dbad757309a512437785d3fcfa826b8061d78fed |
|
MD5 | 4f8f6ecfe62bf8ef907d9586f80af8e3 |
|
BLAKE2b-256 | c3c4f3d10705942955900297410341228ee0d4d312000c8b96e4f2ccb581062a |
Hashes for llama_index_llms_openvino-0.3.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1fc575a5b63d6e8a5471b0d38462f59ae421ee3b5384cbfb3e8a2dbd17cf9893 |
|
MD5 | 020bb9dcbb7b850e27ce190043e26b05 |
|
BLAKE2b-256 | 03dfae55a2698384aebc73e1ab4f36a7bf45c07f7b3f437acaad1283d53c68e1 |