Skip to main content

llama-index llms openvino integration

Project description

LlamaIndex Llms Integration: Openvino

Installation

To install the required packages, run:

%pip install llama-index-llms-openvino transformers huggingface_hub
!pip install llama-index

Setup

Define Functions for Prompt Handling

You will need functions to convert messages and completions into prompts:

from llama_index.llms.openvino import OpenVINOLLM


def messages_to_prompt(messages):
    prompt = ""
    for message in messages:
        if message.role == "system":
            prompt += f"<|system|>\n{message.content}</s>\n"
        elif message.role == "user":
            prompt += f"<|user|>\n{message.content}</s>\n"
        elif message.role == "assistant":
            prompt += f"<|assistant|>\n{message.content}</s>\n"

    # Ensure we start with a system prompt, insert blank if needed
    if not prompt.startswith("<|system|>\n"):
        prompt = "<|system|>\n</s>\n" + prompt

    # Add final assistant prompt
    prompt = prompt + "<|assistant|>\n"

    return prompt


def completion_to_prompt(completion):
    return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"

Model Loading

Models can be loaded by specifying parameters using the OpenVINOLLM method. If you have an Intel GPU, specify device_map="gpu" to run inference on it:

ov_config = {
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "",
}

ov_llm = OpenVINOLLM(
    model_id_or_path="HuggingFaceH4/zephyr-7b-beta",
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"ov_config": ov_config},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    device_map="cpu",
)

response = ov_llm.complete("What is the meaning of life?")
print(str(response))

Inference with Local OpenVINO Model

Export your model to the OpenVINO IR format using the CLI and load it from a local folder. It’s recommended to apply 8 or 4-bit weight quantization to reduce inference latency and model footprint:

!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int8 ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int4 ov_model_dir

You can then load the model from the specified directory:

ov_llm = OpenVINOLLM(
    model_id_or_path="ov_model_dir",
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"ov_config": ov_config},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    device_map="gpu",
)

Additional Optimization

You can get additional inference speed improvements with dynamic quantization of activations and KV-cache quantization. Enable these options with ov_config as follows:

ov_config = {
    "KV_CACHE_PRECISION": "u8",
    "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32",
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "",
}

Streaming Responses

To use the streaming capabilities, you can use the stream_complete and stream_chat methods:

Using stream_complete

response = ov_llm.stream_complete("Who is Paul Graham?")
for r in response:
    print(r.delta, end="")

Using stream_chat

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]

resp = ov_llm.stream_chat(messages)
for r in resp:
    print(r.delta, end="")

LLM Implementation example

https://docs.llamaindex.ai/en/stable/examples/llm/openvino/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_llms_openvino-0.6.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_llms_openvino-0.6.0-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_llms_openvino-0.6.0.tar.gz.

File metadata

  • Download URL: llama_index_llms_openvino-0.6.0.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llama_index_llms_openvino-0.6.0.tar.gz
Algorithm Hash digest
SHA256 a8738d34d614291351eb7d5289402563a124da948a449e99bf42671fe2c6f8ad
MD5 ca05c446a468928198c5a4aa4f0f958d
BLAKE2b-256 63141200e5ef6801b4b440303abf54d2ddf1aca8d34702c2dac4dc0546f728eb

See more details on using hashes here.

File details

Details for the file llama_index_llms_openvino-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: llama_index_llms_openvino-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llama_index_llms_openvino-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4ad94eb7488a79884d89d3df320a6f90242f97a2cad42d1e3862499b68fc869d
MD5 d29738eab5f00fecf079f959c2560dca
BLAKE2b-256 5f73105b29cc7b61ccc255fd788e34eb104975dc6043cfcee64bbe443e0daa32

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page