Repository of Intel® Intel Extension for Transformers

These details have not been verified by PyPI

Project links

Homepage

Project description

Intel® Extension for Transformers

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere

🚀Latest News

[2024/01] Supported INT4 inference on Intel GPUs including Intel Data Center GPU Max Series (e.g., PVC) and Intel Arc A-Series (e.g., ARC). Check out the examples and scripts.
[2024/01] Demonstrated Intel Hybrid Copilot in CES 2024 Great Minds Session "Bringing the Limitless Potential of AI Everywhere".
[2023/12] Supported QLoRA on CPUs to make fine-tuning on client CPU possible. Check out the blog and readme for more details.
[2023/11] Released top-1 7B-sized LLM NeuralChat-v3-1 and DPO dataset. Check out the nice video published by WorldofAI.
[2023/11] Published a 4-bit chatbot demo (based on NeuralChat) available on Intel Hugging Face Space. Welcome to have a try! To setup the demo locally, please follow the instructions.

🏃Installation

Quick Install from Pypi

pip install intel-extension-for-transformers

For system requirements and other installation tips, please refer to Installation Guide

🌟Introduction

Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples:

Seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021's paper Prune Once for All: Sparse Pre-Trained Language Models)
Optimized Transformer-based model packages such as Stable Diffusion, GPT-J-6B, GPT-NEOX, BLOOM-176B, T5, Flan-T5, and end-to-end workflows such as SetFit-based text classification and document level sentiment analysis (DLSA)
NeuralChat, a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of plugins such as Knowledge Retrieval, Speech Interaction, Query Caching, and Security Guardrail. This framework supports Intel Gaudi2/CPU/GPU.
Inference of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels for Intel CPU and Intel GPU (TBD), supporting GPT-NEOX, LLAMA, MPT, FALCON, BLOOM-7B, OPT, ChatGLM2-6B, GPT-J-6B, and Dolly-v2-3B. Support AMX, VNNI, AVX512F and AVX2 instruction set. We've boosted the performance of Intel CPUs, with a particular focus on the 4th generation Intel Xeon Scalable processor, codenamed Sapphire Rapids.

🔓Validated Hardware

Hardware	Fine-Tuning		Inference
Hardware	Full	PEFT	8-bit	4-bit
Intel Gaudi2	✔	✔	WIP (FP8)	-
Intel Xeon Scalable Processors	✔	✔	✔ (INT8, FP8)	✔ (INT4, FP4, NF4)
Intel Xeon CPU Max Series	✔	✔	✔ (INT8, FP8)	✔ (INT4, FP4, NF4)
Intel Data Center GPU Max Series	WIP	WIP	WIP (INT8)	✔ (INT4)
Intel Arc A-Series	-	-	WIP (INT8)	✔ (INT4)
Intel Core Processors	-	✔	✔ (INT8, FP8)	✔ (INT4, FP4, NF4)

In the table above, "-" means not applicable or not started yet.

🔓Validated Software

Software	Fine-Tuning		Inference
Software	Full	PEFT	8-bit	4-bit
PyTorch	2.0.1+cpu, 2.0.1a0 (gpu)	2.0.1+cpu, 2.0.1a0 (gpu)	2.1.0+cpu, 2.0.1a0 (gpu)	2.1.0+cpu, 2.0.1a0 (gpu)
Intel® Extension for PyTorch	2.1.0+cpu, 2.0.110+xpu	2.1.0+cpu, 2.0.110+xpu	2.1.0+cpu, 2.0.110+xpu	2.1.0+cpu, 2.0.110+xpu
Transformers	4.35.2(CPU), 4.31.0 (Intel GPU)	4.35.2(CPU), 4.31.0 (Intel GPU)	4.35.2(CPU), 4.31.0 (Intel GPU)	4.35.2(CPU), 4.31.0 (Intel GPU)
Synapse AI	1.13.0	1.13.0	1.13.0	1.13.0
Gaudi2 driver	1.13.0-ee32e42	1.13.0-ee32e42	1.13.0-ee32e42	1.13.0-ee32e42
intel-level-zero-gpu	1.3.26918.50-736~22.04	1.3.26918.50-736~22.04	1.3.26918.50-736~22.04	1.3.26918.50-736~22.04

Please refer to the detailed requirements in CPU, Gaudi2, Intel GPU.

🔓Validated OS

Ubuntu 20.04/22.04, Centos 8.

🌱Getting Started

Chatbot

Below is the sample code to create your chatbot. See more examples.

Serving (OpenAI-compatible RESTful APIs)

NeuralChat provides OpenAI-compatible RESTful APIs for chat, so you can use NeuralChat as a drop-in replacement for OpenAI APIs. You can start NeuralChat server either using the Shell command or Python code.

# Shell Command
neuralchat_server start --config_file ./server/config/neuralchat.yaml

# Python Code
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
server_executor = NeuralChatServerExecutor()
server_executor(config_file="./server/config/neuralchat.yaml", log_file="./neuralchat.log")

NeuralChat service can be accessible through OpenAI client library, curl commands, and requests library. See more in NeuralChat.

Offline

from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")

Transformers-based extension APIs

Below is the sample code to use the extended Transformers APIs. See more examples.

INT4 Inference (CPU)

We encourage you to install NeuralSpeed to get the latest features (e.g., GGUF support) of LLM low-bit inference on CPUs. You may also want to use v1.3 without NeuralSpeed by following the document

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs)

You can also load PyTorch Model from Modelscope

Note:require modelscope

from transformers import TextStreamer
from modelscope import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "qwen/Qwen-7B"     # Modelscope model_id or local model
prompt = "Once upon a time, there existed a little girl,"

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub="modelscope")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

You can also load the low-bit model quantized by GPTQ/AWQ/RTN/AutoRound algorithm.

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, GPTQConfig

# Download Hugging Face GPTQ/AWQ model or use local quantize model
model_name = "PATH_TO_MODEL"  # local path to model
woq_config = GPTQConfig(bits=4)   # use AwqConfig for AWQ models, and AutoRoundConfig for AutoRound models
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True) 
outputs = model.generate(inputs)

INT4 Inference (GPU)

import intel_extension_for_pytorch as ipex
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
from transformers import AutoTokenizer

device_map = "xpu"
model_name ="Qwen/Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = "Once upon a time, there existed a little girl,"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map)

model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,
                                              device_map=device_map, load_in_4bit=True)

model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, woq=True, device=device_map)

output = model.generate(inputs)

Note: Please refer to the example and script for more details.

Langchain-based extension APIs

Below is the sample code to use the extended Langchain APIs. See more examples.

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain_core.vectorstores import VectorStoreRetriever
from intel_extension_for_transformers.langchain.vectorstores import Chroma
retriever = VectorStoreRetriever(vectorstore=Chroma(...))
retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)

🎯Validated Models

You can access the validated models, accuracy and performance from Release data or Medium blog.

📖Documentation

OVERVIEW
NeuralChat		Neural Speed
NEURALCHAT
Chatbot on Intel CPU	Chatbot on Intel GPU		Chatbot on Gaudi
Chatbot on Client		More Notebooks
NEURAL SPEED
Neural Speed	Streaming LLM	Low Precision Kernels		Tensor Parallelism
LLM COMPRESSION
SmoothQuant (INT8)	Weight-only Quantization (INT4/FP4/NF4/INT8)		QLoRA on CPU
GENERAL COMPRESSION
Quantization	Pruning	Distillation		Orchestration
Neural Architecture Search	Export	Metrics		Objectives
Pipeline	Length Adaptive	Early Exit		Data Augmentation
TUTORIALS & RESULTS
Tutorials	LLM List	General Model List		Model Performance

🙌Demo

LLM Infinite Inference (up to 4M tokens)

https://github.com/intel/intel-extension-for-transformers/assets/109187816/1698dcda-c9ec-4f44-b159-f4e9d67ab15b

LLM QLoRA on Client CPU

https://github.com/intel/intel-extension-for-transformers/assets/88082706/9d9bdb7e-65db-47bb-bbed-d23b151e8b31

📃Selected Publications/Events

Blog of Intel Developer News: Use the neural-chat-7b Model for Advanced Fraud Detection: An AI-Driven Approach in Cybersecurity (March 2024)
CES 2024: CES 2024 Great Minds Keynote: Bringing the Limitless Potential of AI Everywhere: Intel Hybrid Copilot demo (Jan 2024)
Blog published on Medium: Connect an AI agent with your API: Intel Neural-Chat 7b LLM can replace Open AI Function Calling (Dec 2023)
NeurIPS'2023 on Efficient Natural Language and Speech Processing: Efficient LLM Inference on CPUs (Nov 2023)
Blog published on Hugging Face: Intel Neural-Chat 7b: Fine-Tuning on Gaudi2 for Top LLM Performance (Nov 2023)
Blog published on VMware: AI without GPUs: A Technical Brief for VMware Private AI with Intel (Nov 2023)

View Full Publication List

Additional Content

Acknowledgements

Excellent open-source projects: bitsandbytes, FastChat, fastRAG, ggml, gptq, llama.cpp, lm-evauation-harness, peft, trl, streamingllm and many others.
Thanks to all the contributors.

💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us, and we look forward to our collaborations on Intel Extension for Transformers!

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.4.2

May 24, 2024

1.4.1

Apr 21, 2024

This version

1.4

Apr 3, 2024

1.3.2

Feb 24, 2024

1.3.1

Jan 19, 2024

1.3

Dec 22, 2023

1.2.2

Dec 6, 2023

1.2.1

Nov 8, 2023

1.2

Sep 26, 2023

1.1.1

Sep 6, 2023

1.1

Jul 14, 2023

1.0.1

Jun 2, 2023

1.0

Apr 4, 2023

1.0b0 pre-release

Dec 11, 2022

1.0a0 pre-release

Nov 23, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intel-extension-for-transformers-1.4.tar.gz (106.0 MB view details)

Uploaded Apr 3, 2024 Source

Built Distributions

intel_extension_for_transformers-1.4-cp311-cp311-win_amd64.whl (10.7 MB view details)

Uploaded Apr 3, 2024 CPython 3.11 Windows x86-64

intel_extension_for_transformers-1.4-cp311-cp311-manylinux_2_28_x86_64.whl (44.8 MB view details)

Uploaded Apr 3, 2024 CPython 3.11 manylinux: glibc 2.28+ x86-64

intel_extension_for_transformers-1.4-cp310-cp310-win_amd64.whl (10.7 MB view details)

Uploaded Apr 3, 2024 CPython 3.10 Windows x86-64

intel_extension_for_transformers-1.4-cp310-cp310-manylinux_2_28_x86_64.whl (44.8 MB view details)

Uploaded Apr 3, 2024 CPython 3.10 manylinux: glibc 2.28+ x86-64

intel_extension_for_transformers-1.4-cp39-cp39-win_amd64.whl (10.7 MB view details)

Uploaded Apr 3, 2024 CPython 3.9 Windows x86-64

intel_extension_for_transformers-1.4-cp39-cp39-manylinux_2_28_x86_64.whl (44.8 MB view details)

Uploaded Apr 3, 2024 CPython 3.9 manylinux: glibc 2.28+ x86-64

intel_extension_for_transformers-1.4-cp38-cp38-win_amd64.whl (10.7 MB view details)

Uploaded Apr 3, 2024 CPython 3.8 Windows x86-64

intel_extension_for_transformers-1.4-cp38-cp38-manylinux_2_28_x86_64.whl (44.8 MB view details)

Uploaded Apr 3, 2024 CPython 3.8 manylinux: glibc 2.28+ x86-64

File details

Details for the file intel-extension-for-transformers-1.4.tar.gz.

File metadata

Download URL: intel-extension-for-transformers-1.4.tar.gz
Upload date: Apr 3, 2024
Size: 106.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for intel-extension-for-transformers-1.4.tar.gz
Algorithm	Hash digest
SHA256	`c567ba61f89353b9ef06ab3da8267c0b1bc8e0900cde5979231cfcc14c9e7d90`
MD5	`5170e1c02114d812913e54476852c6a9`
BLAKE2b-256	`da5b90bef741b7c3c85285b5ed40b22d61a26358fa09b69efb07cad3e983c4d3`

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp311-cp311-win_amd64.whl.

File metadata

Download URL: intel_extension_for_transformers-1.4-cp311-cp311-win_amd64.whl
Upload date: Apr 3, 2024
Size: 10.7 MB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for intel_extension_for_transformers-1.4-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`0dfc273fa9ca82e8d91de2522c1e804f96c245abeb0da1cc8b781ffd72f31ec5`
MD5	`d41f394fc6cf9cb879a0f16666b13ca1`
BLAKE2b-256	`8d1fa20bef47f16b0858ba53518f01b84a0b67f54b296d3b06e10c112b7f8a85`

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

Download URL: intel_extension_for_transformers-1.4-cp311-cp311-manylinux_2_28_x86_64.whl
Upload date: Apr 3, 2024
Size: 44.8 MB
Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for intel_extension_for_transformers-1.4-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`4cc5c2bdecb11d210aafa620fda1f89fbf6d3f250f1e302831255c83954fed5d`
MD5	`2d47933c9577083b91c5cc83d5c01fc7`
BLAKE2b-256	`bf45bc0a4349e3683496c9b2eec1796cc0bc2997604dcfef67c5c1d7ccf5ae19`

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp310-cp310-win_amd64.whl.

File metadata

Download URL: intel_extension_for_transformers-1.4-cp310-cp310-win_amd64.whl
Upload date: Apr 3, 2024
Size: 10.7 MB
Tags: CPython 3.10, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for intel_extension_for_transformers-1.4-cp310-cp310-win_amd64.whl
Algorithm	Hash digest
SHA256	`c79caeab7d6a4f6eac87d3524829bcb2c24808e3c8871443bd67cce9f05ed610`
MD5	`68e4e282329971e7b65add6b12eb2691`
BLAKE2b-256	`2bbfed3536938ef98751a2853bb3aeabd1ec68b1d31ce919306993455795d293`

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

Download URL: intel_extension_for_transformers-1.4-cp310-cp310-manylinux_2_28_x86_64.whl
Upload date: Apr 3, 2024
Size: 44.8 MB
Tags: CPython 3.10, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for intel_extension_for_transformers-1.4-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`8ad180a6a248f13f0e5cfe1aeb4361b97aa153191a3e8698e8d65d8c0e951191`
MD5	`6a74a51028ee144d3d716424eeec289c`
BLAKE2b-256	`9116195d5779f04607714e0aa8141a6051a85ec067f3866b0e7ebdf2f7482382`

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp39-cp39-win_amd64.whl.

File metadata

Download URL: intel_extension_for_transformers-1.4-cp39-cp39-win_amd64.whl
Upload date: Apr 3, 2024
Size: 10.7 MB
Tags: CPython 3.9, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for intel_extension_for_transformers-1.4-cp39-cp39-win_amd64.whl
Algorithm	Hash digest
SHA256	`2a2be4c164940d8655883da851d3112633770ddd5e634d801ef82d9252a411c8`
MD5	`f54856d3e4355eca6caf57bd6b58f18e`
BLAKE2b-256	`f69a2dc400f549ad5a80ae668f6e6f6b9a9694b6ab9c43fd3f279e1feaeef463`

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

Download URL: intel_extension_for_transformers-1.4-cp39-cp39-manylinux_2_28_x86_64.whl
Upload date: Apr 3, 2024
Size: 44.8 MB
Tags: CPython 3.9, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for intel_extension_for_transformers-1.4-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`435a7415a891cf1cd3b221210751354d8659134525c3d9f8f44f4aaa769d8011`
MD5	`2ff87e5fa0e0714e8da5661b769533b1`
BLAKE2b-256	`c4476ac456cc005f965b0c32a1a1388eb48fe9bfd6b66166049625eb4546ac64`

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp38-cp38-win_amd64.whl.

File metadata

Download URL: intel_extension_for_transformers-1.4-cp38-cp38-win_amd64.whl
Upload date: Apr 3, 2024
Size: 10.7 MB
Tags: CPython 3.8, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for intel_extension_for_transformers-1.4-cp38-cp38-win_amd64.whl
Algorithm	Hash digest
SHA256	`40d831f3efb89e1b4277062217ad5583bfa4f55d1c750aa060fb91f5401a30f7`
MD5	`11dcd8dee4b48c9ca7dc6b2dafb2d016`
BLAKE2b-256	`2f73c99039a96102e230f5be515570bc67c737742053bbe59dc4a6ff51ecde5b`

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.4-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

Download URL: intel_extension_for_transformers-1.4-cp38-cp38-manylinux_2_28_x86_64.whl
Upload date: Apr 3, 2024
Size: 44.8 MB
Tags: CPython 3.8, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for intel_extension_for_transformers-1.4-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`dccf7d9e6cd5f32d1e4ff5fd061f062d940860e002baf6cb689c2b1d04721231`
MD5	`29281e9459db96eca1a23744e68a9eb6`
BLAKE2b-256	`adc2d558501e87ed7871ee935814eefec18f7e46f3adce8629e7b6aa3fe30a6f`

See more details on using hashes here.

intel-extension-for-transformers 1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Intel® Extension for Transformers

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere

🚀Latest News

🏃Installation

Quick Install from Pypi

🌟Introduction

🔓Validated Hardware

🔓Validated Software

🔓Validated OS

🌱Getting Started

Chatbot

Serving (OpenAI-compatible RESTful APIs)

Offline

Transformers-based extension APIs

INT4 Inference (CPU)

INT4 Inference (GPU)

Langchain-based extension APIs

🎯Validated Models

📖Documentation

🙌Demo

📃Selected Publications/Events

Additional Content

Acknowledgements

💁Collaborations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes