Skip to main content

Repository of Intel® Intel Extension for Transformers

Project description

Intel® Extension for Transformers

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere

Release Notes

🏭Architecture   |   💬NeuralChat   |   😃Inference on CPU   |   😃Inference on GPU   |   💻Examples   |   📖Documentations

🚀Latest News


🏃Installation

Quick Install from Pypi

pip install intel-extension-for-transformers

For system requirements and other installation tips, please refer to Installation Guide

🌟Introduction

Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples:

🔓Validated Hardware

Hardware Fine-Tuning Inference
Full PEFT 8-bit 4-bit
Intel Gaudi2 WIP (FP8) -
Intel Xeon Scalable Processors ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)
Intel Xeon CPU Max Series ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)
Intel Data Center GPU Max Series WIP WIP WIP (INT8) ✔ (INT4)
Intel Arc A-Series - - WIP (INT8) ✔ (INT4)
Intel Core Processors - ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)

In the table above, "-" means not applicable or not started yet.

🔓Validated Software

Software Fine-Tuning Inference
Full PEFT 8-bit 4-bit
PyTorch 2.0.1+cpu,
2.0.1a0 (gpu)
2.0.1+cpu,
2.0.1a0 (gpu)
2.1.0+cpu,
2.0.1a0 (gpu)
2.1.0+cpu,
2.0.1a0 (gpu)
Intel® Extension for PyTorch 2.1.0+cpu,
2.0.110+xpu
2.1.0+cpu,
2.0.110+xpu
2.1.0+cpu,
2.0.110+xpu
2.1.0+cpu,
2.0.110+xpu
Transformers 4.35.2(CPU),
4.31.0 (Intel GPU)
4.35.2(CPU),
4.31.0 (Intel GPU)
4.35.2(CPU),
4.31.0 (Intel GPU)
4.35.2(CPU),
4.31.0 (Intel GPU)
Synapse AI 1.13.0 1.13.0 1.13.0 1.13.0
Gaudi2 driver 1.13.0-ee32e42 1.13.0-ee32e42 1.13.0-ee32e42 1.13.0-ee32e42
intel-level-zero-gpu 1.3.26918.50-736~22.04 1.3.26918.50-736~22.04 1.3.26918.50-736~22.04 1.3.26918.50-736~22.04

Please refer to the detailed requirements in CPU, Gaudi2, Intel GPU.

🔓Validated OS

Ubuntu 20.04/22.04, Centos 8.

🌱Getting Started

Chatbot

Below is the sample code to create your chatbot. See more examples.

Serving (OpenAI-compatible RESTful APIs)

NeuralChat provides OpenAI-compatible RESTful APIs for chat, so you can use NeuralChat as a drop-in replacement for OpenAI APIs. You can start NeuralChat server either using the Shell command or Python code.

# Shell Command
neuralchat_server start --config_file ./server/config/neuralchat.yaml
# Python Code
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
server_executor = NeuralChatServerExecutor()
server_executor(config_file="./server/config/neuralchat.yaml", log_file="./neuralchat.log")

NeuralChat service can be accessible through OpenAI client library, curl commands, and requests library. See more in NeuralChat.

Offline

from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")

Transformers-based extension APIs

Below is the sample code to use the extended Transformers APIs. See more examples.

INT4 Inference (CPU)

We encourage you to install NeuralSpeed to get the latest features (e.g., GGUF support) of LLM low-bit inference on CPUs. You may also want to use v1.3 without NeuralSpeed by following the document

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs)

You can also load PyTorch Model from Modelscope

Note:require modelscope

from transformers import TextStreamer
from modelscope import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "qwen/Qwen-7B"     # Modelscope model_id or local model
prompt = "Once upon a time, there existed a little girl,"

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub="modelscope")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

You can also load the low-bit model quantized by GPTQ/AWQ/RTN/AutoRound algorithm.

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, GPTQConfig

# Download Hugging Face GPTQ/AWQ model or use local quantize model
model_name = "PATH_TO_MODEL"  # local path to model
woq_config = GPTQConfig(bits=4)   # use AwqConfig for AWQ models, and AutoRoundConfig for AutoRound models
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True) 
outputs = model.generate(inputs)

INT4 Inference (GPU)

import intel_extension_for_pytorch as ipex
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
from transformers import AutoTokenizer

device_map = "xpu"
model_name ="Qwen/Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = "Once upon a time, there existed a little girl,"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map)

model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,
                                              device_map=device_map, load_in_4bit=True)

model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, woq=True, device=device_map)

output = model.generate(inputs)

Note: Please refer to the example and script for more details.

Langchain-based extension APIs

Below is the sample code to use the extended Langchain APIs. See more examples.

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain_core.vectorstores import VectorStoreRetriever
from intel_extension_for_transformers.langchain.vectorstores import Chroma
retriever = VectorStoreRetriever(vectorstore=Chroma(...))
retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)

🎯Validated Models

You can access the validated models, accuracy and performance from Release data or Medium blog.

📖Documentation

OVERVIEW
NeuralChat Neural Speed
NEURALCHAT
Chatbot on Intel CPU Chatbot on Intel GPU Chatbot on Gaudi
Chatbot on Client More Notebooks
NEURAL SPEED
Neural Speed Streaming LLM Low Precision Kernels Tensor Parallelism
LLM COMPRESSION
SmoothQuant (INT8) Weight-only Quantization (INT4/FP4/NF4/INT8) QLoRA on CPU
GENERAL COMPRESSION
Quantization Pruning Distillation Orchestration
Neural Architecture Search Export Metrics Objectives
Pipeline Length Adaptive Early Exit Data Augmentation
TUTORIALS & RESULTS
Tutorials LLM List General Model List Model Performance

🙌Demo

  • LLM Infinite Inference (up to 4M tokens)

https://github.com/intel/intel-extension-for-transformers/assets/109187816/1698dcda-c9ec-4f44-b159-f4e9d67ab15b

  • LLM QLoRA on Client CPU

https://github.com/intel/intel-extension-for-transformers/assets/88082706/9d9bdb7e-65db-47bb-bbed-d23b151e8b31

📃Selected Publications/Events

View Full Publication List

Additional Content

Acknowledgements

💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us, and we look forward to our collaborations on Intel Extension for Transformers!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intel-extension-for-transformers-1.4.tar.gz (106.0 MB view details)

Uploaded Source

Built Distributions

intel_extension_for_transformers-1.4-cp311-cp311-win_amd64.whl (10.7 MB view details)

Uploaded CPython 3.11 Windows x86-64

intel_extension_for_transformers-1.4-cp311-cp311-manylinux_2_28_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

intel_extension_for_transformers-1.4-cp310-cp310-win_amd64.whl (10.7 MB view details)

Uploaded CPython 3.10 Windows x86-64

intel_extension_for_transformers-1.4-cp310-cp310-manylinux_2_28_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

intel_extension_for_transformers-1.4-cp39-cp39-win_amd64.whl (10.7 MB view details)

Uploaded CPython 3.9 Windows x86-64

intel_extension_for_transformers-1.4-cp39-cp39-manylinux_2_28_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

intel_extension_for_transformers-1.4-cp38-cp38-win_amd64.whl (10.7 MB view details)

Uploaded CPython 3.8 Windows x86-64

intel_extension_for_transformers-1.4-cp38-cp38-manylinux_2_28_x86_64.whl (44.8 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

File details

Details for the file intel-extension-for-transformers-1.4.tar.gz.

File metadata

File hashes

Hashes for intel-extension-for-transformers-1.4.tar.gz
Algorithm Hash digest
SHA256 c567ba61f89353b9ef06ab3da8267c0b1bc8e0900cde5979231cfcc14c9e7d90
MD5 5170e1c02114d812913e54476852c6a9
BLAKE2b-256 da5b90bef741b7c3c85285b5ed40b22d61a26358fa09b69efb07cad3e983c4d3

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 0dfc273fa9ca82e8d91de2522c1e804f96c245abeb0da1cc8b781ffd72f31ec5
MD5 d41f394fc6cf9cb879a0f16666b13ca1
BLAKE2b-256 8d1fa20bef47f16b0858ba53518f01b84a0b67f54b296d3b06e10c112b7f8a85

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4cc5c2bdecb11d210aafa620fda1f89fbf6d3f250f1e302831255c83954fed5d
MD5 2d47933c9577083b91c5cc83d5c01fc7
BLAKE2b-256 bf45bc0a4349e3683496c9b2eec1796cc0bc2997604dcfef67c5c1d7ccf5ae19

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 c79caeab7d6a4f6eac87d3524829bcb2c24808e3c8871443bd67cce9f05ed610
MD5 68e4e282329971e7b65add6b12eb2691
BLAKE2b-256 2bbfed3536938ef98751a2853bb3aeabd1ec68b1d31ce919306993455795d293

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8ad180a6a248f13f0e5cfe1aeb4361b97aa153191a3e8698e8d65d8c0e951191
MD5 6a74a51028ee144d3d716424eeec289c
BLAKE2b-256 9116195d5779f04607714e0aa8141a6051a85ec067f3866b0e7ebdf2f7482382

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 2a2be4c164940d8655883da851d3112633770ddd5e634d801ef82d9252a411c8
MD5 f54856d3e4355eca6caf57bd6b58f18e
BLAKE2b-256 f69a2dc400f549ad5a80ae668f6e6f6b9a9694b6ab9c43fd3f279e1feaeef463

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 435a7415a891cf1cd3b221210751354d8659134525c3d9f8f44f4aaa769d8011
MD5 2ff87e5fa0e0714e8da5661b769533b1
BLAKE2b-256 c4476ac456cc005f965b0c32a1a1388eb48fe9bfd6b66166049625eb4546ac64

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 40d831f3efb89e1b4277062217ad5583bfa4f55d1c750aa060fb91f5401a30f7
MD5 11dcd8dee4b48c9ca7dc6b2dafb2d016
BLAKE2b-256 2f73c99039a96102e230f5be515570bc67c737742053bbe59dc4a6ff51ecde5b

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 dccf7d9e6cd5f32d1e4ff5fd061f062d940860e002baf6cb689c2b1d04721231
MD5 29281e9459db96eca1a23744e68a9eb6
BLAKE2b-256 adc2d558501e87ed7871ee935814eefec18f7e46f3adce8629e7b6aa3fe30a6f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page