Skip to main content

Repository of Intel® Intel Extension for Transformers

Project description

Intel® Extension for Transformers

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere

Release Notes

🏭Architecture   |   💬NeuralChat   |   😃Inference on CPU   |   😃Inference on GPU   |   💻Examples   |   📖Documentations

🚀Latest News


🏃Installation

Quick Install from Pypi

pip install intel-extension-for-transformers

For system requirements and other installation tips, please refer to Installation Guide

🌟Introduction

Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples:

🔓Validated Hardware

Hardware Fine-Tuning Inference
Full PEFT 8-bit 4-bit
Intel Gaudi2 WIP (FP8) -
Intel Xeon Scalable Processors ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)
Intel Xeon CPU Max Series ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)
Intel Data Center GPU Max Series WIP WIP WIP (INT8) ✔ (INT4)
Intel Arc A-Series - - WIP (INT8) ✔ (INT4)
Intel Core Processors - ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)

In the table above, "-" means not applicable or not started yet.

🔓Validated Software

Software Fine-Tuning Inference
Full PEFT 8-bit 4-bit
PyTorch 2.0.1+cpu,
2.0.1a0 (gpu)
2.0.1+cpu,
2.0.1a0 (gpu)
2.1.0+cpu,
2.0.1a0 (gpu)
2.1.0+cpu,
2.0.1a0 (gpu)
Intel® Extension for PyTorch 2.1.0+cpu,
2.0.110+xpu
2.1.0+cpu,
2.0.110+xpu
2.1.0+cpu,
2.0.110+xpu
2.1.0+cpu,
2.0.110+xpu
Transformers 4.35.2(CPU),
4.31.0 (Intel GPU)
4.35.2(CPU),
4.31.0 (Intel GPU)
4.35.2(CPU),
4.31.0 (Intel GPU)
4.35.2(CPU),
4.31.0 (Intel GPU)
Synapse AI 1.13.0 1.13.0 1.13.0 1.13.0
Gaudi2 driver 1.13.0-ee32e42 1.13.0-ee32e42 1.13.0-ee32e42 1.13.0-ee32e42
intel-level-zero-gpu 1.3.26918.50-736~22.04 1.3.26918.50-736~22.04 1.3.26918.50-736~22.04 1.3.26918.50-736~22.04

Please refer to the detailed requirements in CPU, Gaudi2, Intel GPU.

🔓Validated OS

Ubuntu 20.04/22.04, Centos 8.

🌱Getting Started

Chatbot

Below is the sample code to create your chatbot. See more examples.

Serving (OpenAI-compatible RESTful APIs)

NeuralChat provides OpenAI-compatible RESTful APIs for chat, so you can use NeuralChat as a drop-in replacement for OpenAI APIs. You can start NeuralChat server either using the Shell command or Python code.

# Shell Command
neuralchat_server start --config_file ./server/config/neuralchat.yaml
# Python Code
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
server_executor = NeuralChatServerExecutor()
server_executor(config_file="./server/config/neuralchat.yaml", log_file="./neuralchat.log")

NeuralChat service can be accessible through OpenAI client library, curl commands, and requests library. See more in NeuralChat.

Offline

from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")

Transformers-based extension APIs

Below is the sample code to use the extended Transformers APIs. See more examples.

INT4 Inference (CPU)

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs)

You can also load the low-bit model quantized by GPTQ/AWQ/RTN/AutoRound algorithm.

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig

# Download Hugging Face GPTQ/AWQ model or use local quantize model
model_name = "PATH_TO_MODEL"  # local path to model
woq_config = WeightOnlyQuantConfig(use_gptq=True)   # use_awq=True for AWQ; use_autoround=True for AutoRound
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True) 
outputs = model.generate(inputs)

INT4 Inference (GPU)

import intel_extension_for_pytorch as ipex
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
from transformers import AutoTokenizer

device_map = "xpu"
model_name ="Qwen/Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = "Once upon a time, there existed a little girl,"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map)

model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,
                                              device_map=device_map, load_in_4bit=True)

model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, woq=True, device=device_map)

output = model.generate(inputs)

Note: Please refer to the example and script for more details.

Langchain-based extension APIs

Below is the sample code to use the extended Langchain APIs. See more examples.

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain_core.vectorstores import VectorStoreRetriever
from intel_extension_for_transformers.langchain.vectorstores import Chroma
retriever = VectorStoreRetriever(vectorstore=Chroma(...))
retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)

🎯Validated Models

You can access the validated models, accuracy and performance from Release data or Medium blog.

📖Documentation

OVERVIEW
NeuralChat Neural Speed
NEURALCHAT
Chatbot on Intel CPU Chatbot on Intel GPU Chatbot on Gaudi
Chatbot on Client More Notebooks
NEURAL SPEED
Neural Speed Streaming LLM Low Precision Kernels Tensor Parallelism
LLM COMPRESSION
SmoothQuant (INT8) Weight-only Quantization (INT4/FP4/NF4/INT8) QLoRA on CPU
GENERAL COMPRESSION
Quantization Pruning Distillation Orchestration
Neural Architecture Search Export Metrics Objectives
Pipeline Length Adaptive Early Exit Data Augmentation
TUTORIALS & RESULTS
Tutorials LLM List General Model List Model Performance

🙌Demo

  • LLM Infinite Inference (up to 4M tokens)

https://github.com/intel/intel-extension-for-transformers/assets/109187816/1698dcda-c9ec-4f44-b159-f4e9d67ab15b

  • LLM QLoRA on Client CPU

https://github.com/intel/intel-extension-for-transformers/assets/88082706/9d9bdb7e-65db-47bb-bbed-d23b151e8b31

📃Selected Publications/Events

View Full Publication List

Additional Content

Acknowledgements

💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us, and we look forward to our collaborations on Intel Extension for Transformers!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intel-extension-for-transformers-1.3.2.tar.gz (97.2 MB view details)

Uploaded Source

Built Distributions

intel_extension_for_transformers-1.3.2-cp311-cp311-win_amd64.whl (10.6 MB view details)

Uploaded CPython 3.11 Windows x86-64

intel_extension_for_transformers-1.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (44.2 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

intel_extension_for_transformers-1.3.2-cp310-cp310-win_amd64.whl (10.6 MB view details)

Uploaded CPython 3.10 Windows x86-64

intel_extension_for_transformers-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (44.2 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

intel_extension_for_transformers-1.3.2-cp39-cp39-win_amd64.whl (10.6 MB view details)

Uploaded CPython 3.9 Windows x86-64

intel_extension_for_transformers-1.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (44.2 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

intel_extension_for_transformers-1.3.2-cp38-cp38-manylinux_2_28_x86_64.whl (44.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

File details

Details for the file intel-extension-for-transformers-1.3.2.tar.gz.

File metadata

File hashes

Hashes for intel-extension-for-transformers-1.3.2.tar.gz
Algorithm Hash digest
SHA256 556f95b92d96485a1611cda2b728eaaf01f74a2bf9121312cfa6f5e7a56d9812
MD5 43a544e2ef556d0e5e4e81e92a058d26
BLAKE2b-256 99876b4029f6885fa7b78e592a0c80c0898dcf18fe8b0033b9c09e41435bb38b

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3.2-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 8e5b3c37a7fd3ae1513acaebc73e4152fb7d7bc5491b621eea2c8a6e234d0c57
MD5 b333c8e5b53ad379bac9f7d4076265b1
BLAKE2b-256 6765a6a25793266e2acb1d92e411fad986f3e967a1ce921a6a995dda08754154

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7239cba1a3a2ba044aad90c06934811c0c656f452df12e90e1aa0f362391b2db
MD5 5e5d8dcc9a0d2266280ece8f143c1a68
BLAKE2b-256 52d2e4b9323e4155f38986ccb5bde5f588aed57c1db046fd14e32c16a0aed8fe

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3.2-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 97ab3b4ee6a0e168c52c0eaa40046e08c9e9b4b4a509c1b94c00135a7d7e8d5c
MD5 d2d8caa51ee32cb6cbbedb4b5f2b38ca
BLAKE2b-256 06dcb588c18e0f227ab2d510ba4eb4a6ff28d388b55b1b11cebd6270ee2e1974

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1eca33964423a6cbd91f68eccfdf6c223e57a6e3ed4f71649d5f9f677902f15c
MD5 b4b885573a6a21c52ea8589bbd8c9650
BLAKE2b-256 187f03ffbf4b226ce81004221fa46da47b11e5199119331600445f75ae2ab495

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3.2-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 ec4fb3ed8e2128cecd6cc06c02adea02723c34918beb882ee4e203114cf1cfff
MD5 f5c3a3df65f9e26f52d9325109cb8052
BLAKE2b-256 d7b45edbed9a77d95813bffa183d086588573b81a62953b0c43718b1ce35c5ad

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5e16c2ea9a070bd71d4b158630f395192332b2a00701f634e95f55fdae1a2c0a
MD5 7ab29ea1364ae8f8220a334539164cb6
BLAKE2b-256 dc4e4712821e1689ed417323ce1dc4f01f4e51046a8d915d2e50daef1d4cce8a

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3.2-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3.2-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 717c7aff07ade6ec81ceb7f63bc7f2fd94da7399f7fa7f60833f7df174cf0e55
MD5 4c32a6432bbd25e5c52ebf56d76ea405
BLAKE2b-256 4c9a856ea366b769194ce406d3aeb54baf68b3d4e009047bfd3bc60aeee8e67a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page