Skip to main content

Repository of Intel® Intel Extension for Transformers

Project description

Intel® Extension for Transformers

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere

Release Notes

🏭Architecture   |   💬NeuralChat   |   😃Inference   |   💻Examples   |   📖Documentations

🚀Latest News

  • [2023/12] Supported QLoRA on CPUs to make fine-tuning on client CPU possible. Check out the blog and readme for more details.
  • [2023/11] Demonstrated up to 3x LLM inference speedup using Assisted Generation (also called Speculative Decoding) from Hugging Face with Intel optimizations! Check out more details.
  • [2023/11] Refreshed top-1 7B-sized LLM by releasing NeuralChat-v3-1. Check out the nice video published by WorldofAI.
  • [2023/11] Released NeuralChat-v3, new top-1 7B-sized LLM available on Hugging Face. The model is fine-tuned on Intel Gaudi2 with supervised fine-tuning and direct preference optimization. Check out the blog.
  • [2023/11] Published a 4-bit chatbot demo (based on NeuralChat) available on Intel Hugging Face Space. Welcome to have a try! To setup the demo locally, please follow the instructions.
  • [2023/11] Released Fast, accurate, and infinite LLM inference with improved StreamingLLM on Intel CPUs!
  • [2023/11] Our paper Efficient LLM Inference on CPUs has been accepted by NeurIPS'23 on Efficient Natural Language and Speech Processing. Thanks to all the collaborators!
  • [2023/10] LLM runtime, an Intel-optimized GGML compatible runtime, demonstrates up to 15x performance gain in 1st token generation and 1.5x in other token generation over the default llama.cpp.
  • [2023/10] LLM runtime now supports LLM inference with infinite-length inputs up to 4 million tokens, inspired from StreamingLLM.
  • [2023/09] NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next'23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors.
  • [2023/08] NeuralChat supports custom chatbot development and deployment within minutes on broad Intel HWs such as Xeon Scalable Processors, Gaudi2, Xeon CPU Max Series, Data Center GPU Max Series, Arc Series, and Core Processors. Check out Notebooks.
  • [2023/07] LLM runtime extends Hugging Face Transformers API to provide seamless low precision inference for popular LLMs, supporting low precision data types such as INT3/INT4/FP4/NF4/INT5/INT8/FP8.

🏃Installation

Quick Install from Pypi

pip install intel-extension-for-transformers

For more installation methods, please refer to Installation Page

🌟Introduction

Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples:

🔓Validated Hardware

Hardware Fine-Tuning Inference
Full PEFT 8-bit 4-bit
Intel Gaudi2 WIP (FP8) -
Intel Xeon Scalable Processors ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)
Intel Xeon CPU Max Series ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)
Intel Core Processors - ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)

In the table above, "-" means not applicable or not started yet.

Validated Software

Software Fine-Tuning Inference
Full PEFT 8-bit 4-bit
PyTorch 2.0.1+cpu 2.0.1+cpu 2.1.0+cpu 2.1.0+cpu
Intel® Extension for PyTorch 2.1.0+cpu 2.1.0+cpu 2.1.0+cpu 2.1.0+cpu
Transformers 4.35.2 4.35.2 4.35.2 4.35.2
Synapse AI 1.13.0 1.13.0 1.13.0 1.13.0
Gaudi2 driver 1.13.0-ee32e42 1.13.0-ee32e42 1.13.0-ee32e42 1.13.0-ee32e42

Please refer to the detailed requirements in CPU, Gaudi2.

🌱Getting Started

Below is the sample code to create your chatbot. See more examples.

Chatbot

OpenAI-Compatible RESTful APIs

NeuralChat provides OpenAI-compatible RESTful APIs for LLM inference, so you can use NeuralChat as a drop-in replacement for OpenAI APIs. You can start the NeuralChat server either using the Shell command or Python code.

Using Shell Command:

neuralchat_server start --config_file ./server/config/neuralchat.yaml

Using Python Code:

from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
server_executor = NeuralChatServerExecutor()
server_executor(config_file="./server/config/neuralchat.yaml", log_file="./neuralchat.log")

NeuralChat service can also be accessible through OpenAI client library, curl commands, and requests library. See more in NeuralChat.

NeuralChat Python API

# pip install intel-extension-for-transformers
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")

Transformers-based extension APIs

ITREX enhances the user experience for compressing models by extending the capabilities of Hugging Face transformers APIs. Below is the sample code to enable weight-only INT4/INT8 inference. See more examples.

INT4 Inference (CPU only)

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

INT8 Inference (CPU only)

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Langchain-based extension APIs

ITREX provides a comprehensive suite of Langchain-based extension APIs, including advanced retrievers, embedding models and vector stores. These enhancements are carefully crafted to expand the capabilities of the original langchain API, ultimately boosting overall performance. This extension is specifically tailored to enhance the functionality and performance of RAG(Retrieval-Augmented Generation).

Below is the sample code to enable enhanced Chroma API. See more examples.

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain_core.vectorstores import VectorStoreRetriever
from intel_extension_for_transformers.langchain.vectorstores import Chroma
retriever = VectorStoreRetriever(vectorstore=Chroma(...))
retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)

🎯Validated Models

You can access the latest int4 performance and accuracy at int4 blog.

Additionally, we are preparing to introduce Baichuan, Mistral, and other models into Neural Speed (Intel Optimized llamacpp). For comprehensive accuracy and performance data, though not the most up-to-date, please refer to the Release data.

📖Documentation

OVERVIEW
NeuralChat Neural Speed
NEURALCHAT
Chatbot on Intel CPU Chatbot on Intel GPU Chatbot on Gaudi
Chatbot on Client More Notebooks
NEURAL SPEED
Neural Speed Streaming LLM Low Precision Kernels Tensor Parallelism
LLM COMPRESSION
SmoothQuant (INT8) Weight-only Quantization (INT4/FP4/NF4/INT8) QLoRA on CPU
GENERAL COMPRESSION
Quantization Pruning Distillation Orchestration
Neural Architecture Search Export Metrics Objectives
Pipeline Length Adaptive Early Exit Data Augmentation
TUTORIALS & RESULTS
Tutorials LLM List General Model List Model Performance

🙌Demo

  • LLM Infinite Inference (up to 4M tokens)

https://github.com/intel/intel-extension-for-transformers/assets/109187816/1698dcda-c9ec-4f44-b159-f4e9d67ab15b

  • LLM QLoRA on Client CPU

https://github.com/intel/intel-extension-for-transformers/assets/88082706/9d9bdb7e-65db-47bb-bbed-d23b151e8b31

📃Selected Publications/Events

View Full Publication List.

Additional Content

Acknowledgements

💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us, and we look forward to our collaborations on Intel Extension for Transformers!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intel-extension-for-transformers-1.3.1.tar.gz (96.8 MB view details)

Uploaded Source

Built Distributions

intel_extension_for_transformers-1.3.1-cp311-cp311-win_amd64.whl (10.6 MB view details)

Uploaded CPython 3.11 Windows x86-64

intel_extension_for_transformers-1.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (44.2 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

intel_extension_for_transformers-1.3.1-cp310-cp310-win_amd64.whl (10.6 MB view details)

Uploaded CPython 3.10 Windows x86-64

intel_extension_for_transformers-1.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (44.2 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

intel_extension_for_transformers-1.3.1-cp39-cp39-win_amd64.whl (10.6 MB view details)

Uploaded CPython 3.9 Windows x86-64

intel_extension_for_transformers-1.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (44.2 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

File details

Details for the file intel-extension-for-transformers-1.3.1.tar.gz.

File metadata

File hashes

Hashes for intel-extension-for-transformers-1.3.1.tar.gz
Algorithm Hash digest
SHA256 7bd1eacc11ee09d0f38e571f0fd18b5734aeb4319cb73ffc7bd1b651d3f2a8cf
MD5 1e66f84f01ef5351a2cd03187405d7c7
BLAKE2b-256 0f1efa99f297ca10f217ae14013ba4d8f27d08b68780c269941a5a1ce9ae58f4

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3.1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 43f4522f7b09b0656717f352ab7db9f47a90a04572ded6981966295c19037c70
MD5 b5ba4ece7aeecdd9376e4d3b6d324770
BLAKE2b-256 c80cd89954632b183f0c102c4f5ca3634e1c1e56b0faf0cffdd897af07f43f22

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b86849e4038bb89c2a6f6965939f1f8ee0f3433a904816bf38bcb713ae22321a
MD5 7751a08d839feea20c454778f1cf3d84
BLAKE2b-256 0344bd8a455a8b6abbd137bf36188655894aeb81b45ecc012d430c94d1d0dc01

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3.1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 993e7262ab606a2fbd032aa36d0603c61aba00381f3bfe90761982d5cb967718
MD5 ca6eac78af2965abba51dc08f1cafd64
BLAKE2b-256 64d1ec15c19ac5e8121c6e04b6ec86918868f533dc4dcda35dc6c39b54472ad1

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e2d80688350071652871fbaa2a49226b91c4eeefda08bb851f4a1ba0dea1b878
MD5 386650a4cd3336325004f90b23f6252a
BLAKE2b-256 ef8a14f0529db86959b3ec00eac943325685ec6a7347bfcb881e839bba0b8037

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3.1-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 760e53772dfb003d6200d5a15168c63f66a9ba4390a19da61b8fb93173ff5570
MD5 443b0b66a058f577c29bb16512c55be6
BLAKE2b-256 909e9d41b109a053d7a8678c79593a0438f0714868a7b200937e73aee0753802

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a64dc1e0b4623bdb2a07f1de1408de64ee5db45b622382d008b15df571121fdb
MD5 db4eb7e18177ea281f9bfe070126ba82
BLAKE2b-256 6bf084210d4cc9b93b1975dd857785bef8f8b8f4ed940fb4bfaa7d99803c61d1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page