Skip to main content

Repository of Intel® Intel Extension for Transformers

Project description

Intel® Extension for Transformers

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere

Release Notes

🏭Architecture   |   💬NeuralChat   |   😃Inference on CPU   |   😃Inference on GPU   |   💻Examples   |   📖Documentations

🚀Latest News


🏃Installation

Quick Install from Pypi

pip install intel-extension-for-transformers

For system requirements and other installation tips, please refer to Installation Guide

🌟Introduction

Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples:

🔓Validated Hardware

Hardware Fine-Tuning Inference
Full PEFT 8-bit 4-bit
Intel Gaudi2 WIP (FP8) -
Intel Xeon Scalable Processors ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)
Intel Xeon CPU Max Series ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)
Intel Data Center GPU Max Series WIP WIP WIP (INT8) ✔ (INT4)
Intel Arc A-Series - - WIP (INT8) ✔ (INT4)
Intel Core Processors - ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)

In the table above, "-" means not applicable or not started yet.

🔓Validated Software

Software Fine-Tuning Inference
Full PEFT 8-bit 4-bit
PyTorch 2.0.1+cpu,
2.0.1a0 (gpu)
2.0.1+cpu,
2.0.1a0 (gpu)
2.1.0+cpu,
2.0.1a0 (gpu)
2.1.0+cpu,
2.0.1a0 (gpu)
Intel® Extension for PyTorch 2.1.0+cpu,
2.0.110+xpu
2.1.0+cpu,
2.0.110+xpu
2.1.0+cpu,
2.0.110+xpu
2.1.0+cpu,
2.0.110+xpu
Transformers 4.35.2(CPU),
4.31.0 (Intel GPU)
4.35.2(CPU),
4.31.0 (Intel GPU)
4.35.2(CPU),
4.31.0 (Intel GPU)
4.35.2(CPU),
4.31.0 (Intel GPU)
Synapse AI 1.13.0 1.13.0 1.13.0 1.13.0
Gaudi2 driver 1.13.0-ee32e42 1.13.0-ee32e42 1.13.0-ee32e42 1.13.0-ee32e42
intel-level-zero-gpu 1.3.26918.50-736~22.04 1.3.26918.50-736~22.04 1.3.26918.50-736~22.04 1.3.26918.50-736~22.04

Please refer to the detailed requirements in CPU, Gaudi2, Intel GPU.

🔓Validated OS

Ubuntu 20.04/22.04, Centos 8.

🌱Getting Started

Chatbot

Below is the sample code to create your chatbot. See more examples.

Serving (OpenAI-compatible RESTful APIs)

NeuralChat provides OpenAI-compatible RESTful APIs for chat, so you can use NeuralChat as a drop-in replacement for OpenAI APIs. You can start NeuralChat server either using the Shell command or Python code.

# Shell Command
neuralchat_server start --config_file ./server/config/neuralchat.yaml
# Python Code
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
server_executor = NeuralChatServerExecutor()
server_executor(config_file="./server/config/neuralchat.yaml", log_file="./neuralchat.log")

NeuralChat service can be accessible through OpenAI client library, curl commands, and requests library. See more in NeuralChat.

Offline

from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")

Transformers-based extension APIs

Below is the sample code to use the extended Transformers APIs. See more examples.

INT4 Inference (CPU)

We encourage you to install NeuralSpeed to get the latest features (e.g., GGUF support) of LLM low-bit inference on CPUs. You may also want to use v1.3 without NeuralSpeed by following the document

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs)

You can also load GGUF format model from Huggingface, we only support Q4_0 gguf format for now.

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
gguf_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, gguf_file = gguf_file)
outputs = model.generate(inputs)

You can also load PyTorch Model from Modelscope

Note:require modelscope

from transformers import TextStreamer
from modelscope import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "qwen/Qwen-7B"     # Modelscope model_id or local model
prompt = "Once upon a time, there existed a little girl,"

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub="modelscope")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

You can also load the low-bit model quantized by GPTQ/AWQ/RTN/AutoRound algorithm.

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, GPTQConfig

# Hugging Face GPTQ/AWQ model or use local quantize model
model_name = "MODEL_NAME_OR_PATH"
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
outputs = model.generate(inputs)

INT4 Inference (GPU)

import intel_extension_for_pytorch as ipex
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
from transformers import AutoTokenizer
import torch

device_map = "xpu"
model_name ="Qwen/Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = "Once upon a time, there existed a little girl,"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map)

model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,
                                              device_map=device_map, load_in_4bit=True)

model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, quantization_config=True, device=device_map)

output = model.generate(inputs)

Note: Please refer to the example and script for more details.

Langchain-based extension APIs

Below is the sample code to use the extended Langchain APIs. See more examples.

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain_core.vectorstores import VectorStoreRetriever
from intel_extension_for_transformers.langchain.vectorstores import Chroma
retriever = VectorStoreRetriever(vectorstore=Chroma(...))
retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)

🎯Validated Models

You can access the validated models, accuracy and performance from Release data or Medium blog.

📖Documentation

OVERVIEW
NeuralChat Neural Speed
NEURALCHAT
Chatbot on Intel CPU Chatbot on Intel GPU Chatbot on Gaudi
Chatbot on Client More Notebooks
NEURAL SPEED
Neural Speed Streaming LLM Low Precision Kernels Tensor Parallelism
LLM COMPRESSION
SmoothQuant (INT8) Weight-only Quantization (INT4/FP4/NF4/INT8) QLoRA on CPU
GENERAL COMPRESSION
Quantization Pruning Distillation Orchestration
Neural Architecture Search Export Metrics Objectives
Pipeline Length Adaptive Early Exit Data Augmentation
TUTORIALS & RESULTS
Tutorials LLM List General Model List Model Performance

🙌Demo

  • LLM Infinite Inference (up to 4M tokens)

https://github.com/intel/intel-extension-for-transformers/assets/109187816/1698dcda-c9ec-4f44-b159-f4e9d67ab15b

  • LLM QLoRA on Client CPU

https://github.com/intel/intel-extension-for-transformers/assets/88082706/9d9bdb7e-65db-47bb-bbed-d23b151e8b31

📃Selected Publications/Events

View Full Publication List

Additional Content

Acknowledgements

💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us, and we look forward to our collaborations on Intel Extension for Transformers!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intel_extension_for_transformers-1.4.2.tar.gz (106.5 MB view details)

Uploaded Source

Built Distributions

intel_extension_for_transformers-1.4.2-cp311-cp311-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.11 Windows x86-64

intel_extension_for_transformers-1.4.2-cp311-cp311-manylinux_2_28_x86_64.whl (45.3 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

intel_extension_for_transformers-1.4.2-cp310-cp310-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.10 Windows x86-64

intel_extension_for_transformers-1.4.2-cp310-cp310-manylinux_2_28_x86_64.whl (45.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

intel_extension_for_transformers-1.4.2-cp39-cp39-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.9 Windows x86-64

intel_extension_for_transformers-1.4.2-cp39-cp39-manylinux_2_28_x86_64.whl (45.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

intel_extension_for_transformers-1.4.2-cp38-cp38-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.8 Windows x86-64

intel_extension_for_transformers-1.4.2-cp38-cp38-manylinux_2_28_x86_64.whl (45.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

File details

Details for the file intel_extension_for_transformers-1.4.2.tar.gz.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4.2.tar.gz
Algorithm Hash digest
SHA256 946d74edec0dc55be1aa248f0f64d86aac558f782b5b33b4de47313681b48e0c
MD5 20fbd4689ec2c7472697f3cf3d6fe470
BLAKE2b-256 091ddd28044cc9a4fb7d152aef0bbb3d78d631504609f1bfde512557daae54ba

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4.2-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 ee72ff99be4528c6e2e600c0b5c770c3986d1d07525d9f1adf25cbd2246b3acd
MD5 279c18c5466b7f2464b6a3e3216d9bdc
BLAKE2b-256 bf245d0289f3ed91a135af1da1548ca03caadb0b58edf254edf975eb92facc83

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4.2-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4.2-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f9f5d6f1a24a817244a2625c3486f67b181cc6279b14ab9a6a3cfe22d663bc02
MD5 80a905bf9b29b5c39b702e52dda4baba
BLAKE2b-256 ce738ab583a1dec951684e42b71fd0058c1c9bfc7ae59c42f741d6e698bcf978

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4.2-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 1bf320fd1bc2c1642a19268dcbc4b2517292cfc89285403cf066dfe2a23d64d0
MD5 7691bd8131cedb3e2a8a3f71557b3406
BLAKE2b-256 61d023d785db0d59da3c676d16d70f5b97235cdc6d6caac0dbf5efd1ede5baba

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4.2-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4.2-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ef87d2d47be3316aae96a479ad73b3397ae63f38225c728050ac01fab482abc2
MD5 9641f896b26ca628aa319136bc413bd3
BLAKE2b-256 78dc1b571b4cf41070708e7aa0b2c9e3054c4c3b480c2f63517a6e9fda42ee57

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4.2-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 0b2437d6d7afb5c46c587410e8dcd31391ffee56aae377c7ad8dd962d4094d3a
MD5 2f69656f9a157a3fb67df260d5adf074
BLAKE2b-256 b937d65570275174553d0a7a238d3b8f08e9ce26272d534136c762bcc02d4270

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4.2-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4.2-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f91eb6848d6fe6ba6cf0e1232c55022e3fbcc3656b42f490cbdaa6a5791ea2f9
MD5 4988e63df0c6df212c3ab1116e511dab
BLAKE2b-256 4c06492809535ee03c8e213c7510fefd8ce75931cd1f0a411cb318c251310d0e

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4.2-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 165c9b4ba577ebc02d7d860f4d071c29e7f4982be53c80ff2fd5ba95ac711aaa
MD5 5debe603790f179e1515403b5ce88f9e
BLAKE2b-256 00f6401e202aa9c6df1bce66ef9732f397be6b14f1ff079c48cb64b99b284914

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4.2-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.4.2-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0b56d2a3081acfa65bfba51f65eabd3d8b0c0ba8919363c58200402211903e36
MD5 d73076d7015edacd1fd3ea75bbf4fbf6
BLAKE2b-256 9d811053e18663de1eca0d3e185524ef197f9a6a91aabc7e0e04383a0910694c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page