llama-index llms openvino integration

These details have not been verified by PyPI

Project description

LlamaIndex Llms Integration: Openvino

Installation

To install the required packages, run:

%pip install llama-index-llms-openvino transformers huggingface_hub
!pip install llama-index

Setup

Define Functions for Prompt Handling

You will need functions to convert messages and completions into prompts:

from llama_index.llms.openvino import OpenVINOLLM


def messages_to_prompt(messages):
    prompt = ""
    for message in messages:
        if message.role == "system":
            prompt += f"<|system|>\n{message.content}</s>\n"
        elif message.role == "user":
            prompt += f"<|user|>\n{message.content}</s>\n"
        elif message.role == "assistant":
            prompt += f"<|assistant|>\n{message.content}</s>\n"

    # Ensure we start with a system prompt, insert blank if needed
    if not prompt.startswith("<|system|>\n"):
        prompt = "<|system|>\n</s>\n" + prompt

    # Add final assistant prompt
    prompt = prompt + "<|assistant|>\n"

    return prompt


def completion_to_prompt(completion):
    return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"

Model Loading

Models can be loaded by specifying parameters using the OpenVINOLLM method. If you have an Intel GPU, specify device_map="gpu" to run inference on it:

ov_config = {
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "",
}

ov_llm = OpenVINOLLM(
    model_id_or_path="HuggingFaceH4/zephyr-7b-beta",
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"ov_config": ov_config},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    device_map="cpu",
)

response = ov_llm.complete("What is the meaning of life?")
print(str(response))

Inference with Local OpenVINO Model

Export your model to the OpenVINO IR format using the CLI and load it from a local folder. It’s recommended to apply 8 or 4-bit weight quantization to reduce inference latency and model footprint:

!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int8 ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int4 ov_model_dir

You can then load the model from the specified directory:

ov_llm = OpenVINOLLM(
    model_id_or_path="ov_model_dir",
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"ov_config": ov_config},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    device_map="gpu",
)

Additional Optimization

You can get additional inference speed improvements with dynamic quantization of activations and KV-cache quantization. Enable these options with ov_config as follows:

ov_config = {
    "KV_CACHE_PRECISION": "u8",
    "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32",
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "",
}

Streaming Responses

To use the streaming capabilities, you can use the stream_complete and stream_chat methods:

Using `stream_complete`

response = ov_llm.stream_complete("Who is Paul Graham?")
for r in response:
    print(r.delta, end="")

Using `stream_chat`

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]

resp = ov_llm.stream_chat(messages)
for r in resp:
    print(r.delta, end="")

LLM Implementation example

https://docs.llamaindex.ai/en/stable/examples/llm/openvino/

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.0

Mar 12, 2026

This version

0.5.1

Sep 8, 2025

0.5.0

Jul 31, 2025

0.4.0

Nov 18, 2024

0.3.2

Oct 8, 2024

0.3.1

Sep 20, 2024

0.3.0

Aug 22, 2024

0.2.2

Aug 15, 2024

0.2.1

Aug 9, 2024

0.2.0

Aug 1, 2024

0.1.1

Jul 3, 2024

0.1.0

Apr 10, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_llms_openvino-0.5.1.tar.gz (6.0 kB view details)

Uploaded Sep 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llama_index_llms_openvino-0.5.1-py3-none-any.whl (5.8 kB view details)

Uploaded Sep 8, 2025 Python 3

File details

Details for the file llama_index_llms_openvino-0.5.1.tar.gz.

File metadata

Download URL: llama_index_llms_openvino-0.5.1.tar.gz
Upload date: Sep 8, 2025
Size: 6.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.13

File hashes

Hashes for llama_index_llms_openvino-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`662cc9e261757d2dddccd14d908c25b0ff8e8c5801a3b95c9a3911066b29519a`
MD5	`fb947b7a756f9f1e6f379e8fd2f37b02`
BLAKE2b-256	`31ac7100bf0630c86b338ea71f9f2b629f4ae11aa05eeac4548291484e5c7ac4`

See more details on using hashes here.

File details

Details for the file llama_index_llms_openvino-0.5.1-py3-none-any.whl.

File metadata

Download URL: llama_index_llms_openvino-0.5.1-py3-none-any.whl
Upload date: Sep 8, 2025
Size: 5.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.13

File hashes

Hashes for llama_index_llms_openvino-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`598b5324659a278e4f3f15cd9da8247115da3b783abb9025fa7ca4178d513afa`
MD5	`03e32a56bb57abd7ca98f861411f264b`
BLAKE2b-256	`b11884ae79645135ed4eacc13507867ca28ce3597ba12e9208063eaca2417dfe`

See more details on using hashes here.

llama-index-llms-openvino 0.5.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LlamaIndex Llms Integration: Openvino

Installation

Setup

Define Functions for Prompt Handling

Model Loading

Inference with Local OpenVINO Model

Additional Optimization

Streaming Responses

Using `stream_complete`

Using `stream_chat`

LLM Implementation example

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

llama-index-llms-openvino 0.5.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LlamaIndex Llms Integration: Openvino

Installation

Setup

Define Functions for Prompt Handling

Model Loading

Inference with Local OpenVINO Model

Additional Optimization

Streaming Responses

Using stream_complete

Using stream_chat

LLM Implementation example

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Using `stream_complete`

Using `stream_chat`