Skip to main content

llama-index llms openvino integration

Project description

LlamaIndex Llms Integration: Openvino

Installation

To install the required packages, run:

%pip install llama-index-llms-openvino transformers huggingface_hub
!pip install llama-index

Setup

Define Functions for Prompt Handling

You will need functions to convert messages and completions into prompts:

from llama_index.llms.openvino import OpenVINOLLM


def messages_to_prompt(messages):
    prompt = ""
    for message in messages:
        if message.role == "system":
            prompt += f"<|system|>\n{message.content}</s>\n"
        elif message.role == "user":
            prompt += f"<|user|>\n{message.content}</s>\n"
        elif message.role == "assistant":
            prompt += f"<|assistant|>\n{message.content}</s>\n"

    # Ensure we start with a system prompt, insert blank if needed
    if not prompt.startswith("<|system|>\n"):
        prompt = "<|system|>\n</s>\n" + prompt

    # Add final assistant prompt
    prompt = prompt + "<|assistant|>\n"

    return prompt


def completion_to_prompt(completion):
    return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"

Model Loading

Models can be loaded by specifying parameters using the OpenVINOLLM method. If you have an Intel GPU, specify device_map="gpu" to run inference on it:

ov_config = {
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "",
}

ov_llm = OpenVINOLLM(
    model_id_or_path="HuggingFaceH4/zephyr-7b-beta",
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"ov_config": ov_config},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    device_map="cpu",
)

response = ov_llm.complete("What is the meaning of life?")
print(str(response))

Inference with Local OpenVINO Model

Export your model to the OpenVINO IR format using the CLI and load it from a local folder. It’s recommended to apply 8 or 4-bit weight quantization to reduce inference latency and model footprint:

!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int8 ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int4 ov_model_dir

You can then load the model from the specified directory:

ov_llm = OpenVINOLLM(
    model_id_or_path="ov_model_dir",
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"ov_config": ov_config},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    device_map="gpu",
)

Additional Optimization

You can get additional inference speed improvements with dynamic quantization of activations and KV-cache quantization. Enable these options with ov_config as follows:

ov_config = {
    "KV_CACHE_PRECISION": "u8",
    "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32",
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "",
}

Streaming Responses

To use the streaming capabilities, you can use the stream_complete and stream_chat methods:

Using stream_complete

response = ov_llm.stream_complete("Who is Paul Graham?")
for r in response:
    print(r.delta, end="")

Using stream_chat

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]

resp = ov_llm.stream_chat(messages)
for r in resp:
    print(r.delta, end="")

LLM Implementation example

https://docs.llamaindex.ai/en/stable/examples/llm/openvino/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_llms_openvino-0.5.1.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_llms_openvino-0.5.1-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_llms_openvino-0.5.1.tar.gz.

File metadata

File hashes

Hashes for llama_index_llms_openvino-0.5.1.tar.gz
Algorithm Hash digest
SHA256 662cc9e261757d2dddccd14d908c25b0ff8e8c5801a3b95c9a3911066b29519a
MD5 fb947b7a756f9f1e6f379e8fd2f37b02
BLAKE2b-256 31ac7100bf0630c86b338ea71f9f2b629f4ae11aa05eeac4548291484e5c7ac4

See more details on using hashes here.

File details

Details for the file llama_index_llms_openvino-0.5.1-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_llms_openvino-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 598b5324659a278e4f3f15cd9da8247115da3b783abb9025fa7ca4178d513afa
MD5 03e32a56bb57abd7ca98f861411f264b
BLAKE2b-256 b11884ae79645135ed4eacc13507867ca28ce3597ba12e9208063eaca2417dfe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page