Repository of Intel® Intel Extension for Transformers
Project description
Intel® Extension for Transformers
An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere
🏭Architecture | 💬NeuralChat | 😃Inference | 💻Examples | 📖Documentations
🚀Latest News
- [2023/12] Supported QLoRA on CPUs to make fine-tuning on client CPU possible. Check out the blog and readme for more details.
- [2023/11] Demonstrated up to 3x LLM inference speedup using Assisted Generation (also called Speculative Decoding) from Hugging Face with Intel optimizations! Check out more details.
- [2023/11] Refreshed top-1 7B-sized LLM by releasing NeuralChat-v3-1. Check out the nice video published by WorldofAI.
- [2023/11] Released NeuralChat-v3, new top-1 7B-sized LLM available on Hugging Face. The model is fine-tuned on Intel Gaudi2 with supervised fine-tuning and direct preference optimization. Check out the blog.
- [2023/11] Published a 4-bit chatbot demo (based on NeuralChat) available on Intel Hugging Face Space. Welcome to have a try! To setup the demo locally, please follow the instructions.
- [2023/11] Released Fast, accurate, and infinite LLM inference with improved StreamingLLM on Intel CPUs!
- [2023/11] Our paper Efficient LLM Inference on CPUs has been accepted by NeurIPS'23 on Efficient Natural Language and Speech Processing. Thanks to all the collaborators!
- [2023/10] LLM runtime, an Intel-optimized GGML compatible runtime, demonstrates up to 15x performance gain in 1st token generation and 1.5x in other token generation over the default llama.cpp.
- [2023/10] LLM runtime now supports LLM inference with infinite-length inputs up to 4 million tokens, inspired from StreamingLLM.
- [2023/09] NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next'23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors.
- [2023/08] NeuralChat supports custom chatbot development and deployment within minutes on broad Intel HWs such as Xeon Scalable Processors, Gaudi2, Xeon CPU Max Series, Data Center GPU Max Series, Arc Series, and Core Processors. Check out Notebooks.
- [2023/07] LLM runtime extends Hugging Face Transformers API to provide seamless low precision inference for popular LLMs, supporting low precision data types such as INT3/INT4/FP4/NF4/INT5/INT8/FP8.
🏃Installation
Quick Install from Pypi
pip install intel-extension-for-transformers
For more installation methods, please refer to Installation Page
🌟Introduction
Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples:
-
Seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
-
Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021's paper Prune Once for All: Sparse Pre-Trained Language Models)
-
Optimized Transformer-based model packages such as Stable Diffusion, GPT-J-6B, GPT-NEOX, BLOOM-176B, T5, Flan-T5, and end-to-end workflows such as SetFit-based text classification and document level sentiment analysis (DLSA)
-
NeuralChat, a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of plugins Knowledge Retrieval, Speech Interaction, Query Caching, and Security Guardrail. This framework supports Intel Gaudi2/CPU/GPU.
-
Inference of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels for Intel CPU and Intel GPU (TBD), supporting GPT-NEOX, LLAMA, MPT, FALCON, BLOOM-7B, OPT, ChatGLM2-6B, GPT-J-6B, and Dolly-v2-3B. Support AMX, VNNI, AVX512F and AVX2 instruction set. We've boosted the performance of Intel CPUs, with a particular focus on the 4th generation Intel Xeon Scalable processor, codenamed Sapphire Rapids.
🔓Validated Hardware
Hardware | Fine-Tuning | Inference | ||
Full | PEFT | 8-bit | 4-bit | |
Intel Gaudi2 | ✔ | ✔ | WIP (FP8) | - |
Intel Xeon Scalable Processors | ✔ | ✔ | ✔ (INT8, FP8) | ✔ (INT4, FP4, NF4) |
Intel Xeon CPU Max Series | ✔ | ✔ | ✔ (INT8, FP8) | ✔ (INT4, FP4, NF4) |
Intel Core Processors | - | ✔ | ✔ (INT8, FP8) | ✔ (INT4, FP4, NF4) |
In the table above, "-" means not applicable or not started yet.
Validated Software
Software | Fine-Tuning | Inference | ||
Full | PEFT | 8-bit | 4-bit | |
PyTorch | 2.0.1+cpu | 2.0.1+cpu | 2.1.0+cpu | 2.1.0+cpu |
Intel® Extension for PyTorch | 2.1.0+cpu | 2.1.0+cpu | 2.1.0+cpu | 2.1.0+cpu |
Transformers | 4.35.2 | 4.35.2 | 4.35.2 | 4.35.2 |
Synapse AI | 1.13.0 | 1.13.0 | 1.13.0 | 1.13.0 |
Gaudi2 driver | 1.13.0-ee32e42 | 1.13.0-ee32e42 | 1.13.0-ee32e42 | 1.13.0-ee32e42 |
🌱Getting Started
Below is the sample code to create your chatbot. See more examples.
Chatbot
OpenAI-Compatible RESTful APIs
NeuralChat provides OpenAI-compatible RESTful APIs for LLM inference, so you can use NeuralChat as a drop-in replacement for OpenAI APIs. You can start the NeuralChat server either using the Shell command or Python code.
Using Shell Command:
neuralchat_server start --config_file ./server/config/neuralchat.yaml
Using Python Code:
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
server_executor = NeuralChatServerExecutor()
server_executor(config_file="./server/config/neuralchat.yaml", log_file="./neuralchat.log")
NeuralChat service can also be accessible through OpenAI client library, curl
commands, and requests
library. See more in NeuralChat.
NeuralChat Python API
# pip install intel-extension-for-transformers
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
Transformers-based extension APIs
ITREX enhances the user experience for compressing models by extending the capabilities of Hugging Face transformers APIs. Below is the sample code to enable weight-only INT4/INT8 inference. See more examples.
INT4 Inference (CPU only)
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1" # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
INT8 Inference (CPU only)
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1" # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
Langchain-based extension APIs
ITREX provides a comprehensive suite of Langchain-based extension APIs, including advanced retrievers, embedding models and vector stores. These enhancements are carefully crafted to expand the capabilities of the original langchain API, ultimately boosting overall performance. This extension is specifically tailored to enhance the functionality and performance of RAG(Retrieval-Augmented Generation).
Below is the sample code to enable enhanced Chroma API. See more examples.
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain_core.vectorstores import VectorStoreRetriever
from intel_extension_for_transformers.langchain.vectorstores import Chroma
retriever = VectorStoreRetriever(vectorstore=Chroma(...))
retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)
🎯Validated Models
You can access the latest int4 performance and accuracy at int4 blog.
Additionally, we are preparing to introduce Baichuan, Mistral, and other models into Neural Speed (Intel Optimized llamacpp). For comprehensive accuracy and performance data, though not the most up-to-date, please refer to the Release data.
📖Documentation
OVERVIEW | |||||||
---|---|---|---|---|---|---|---|
NeuralChat | Neural Speed | ||||||
NEURALCHAT | |||||||
Chatbot on Intel CPU | Chatbot on Intel GPU | Chatbot on Gaudi | |||||
Chatbot on Client | More Notebooks | ||||||
NEURAL SPEED | |||||||
Neural Speed | Streaming LLM | Low Precision Kernels | Tensor Parallelism | ||||
LLM COMPRESSION | |||||||
SmoothQuant (INT8) | Weight-only Quantization (INT4/FP4/NF4/INT8) | QLoRA on CPU | |||||
GENERAL COMPRESSION | |||||||
Quantization | Pruning | Distillation | Orchestration | ||||
Neural Architecture Search | Export | Metrics | Objectives | ||||
Pipeline | Length Adaptive | Early Exit | Data Augmentation | ||||
TUTORIALS & RESULTS | |||||||
Tutorials | LLM List | General Model List | Model Performance |
🙌Demo
- LLM Infinite Inference (up to 4M tokens)
- LLM QLoRA on Client CPU
📃Selected Publications/Events
- Video on YouTube: CES 2024 Great Minds Keynote: Bringing the Limitless Potential of AI Everywhere: Intel Hybrid Copilot demo (Jan 2024)
- Blog published on Medium: Connect an AI agent with your API: Intel Neural-Chat 7b LLM can replace Open AI Function Calling (Dec 2023)
- Blog published on 360 EEA (A News Platform about AI and LLMs): Intel neural-chat-7b-v3-1 (Dec 2023)
- Apple Podcasts from Papers Read on AI: Efficient LLM Inference on CPUs (Dec 2023)
- NeurIPS'2023 on Efficient Natural Language and Speech Processing: Efficient LLM Inference on CPUs (Nov 2023)
- NeurIPS'2023 on Diffusion Models: Effective Quantization for Diffusion Models on CPUs (Nov 2023)
- Blog published on datalearner: Analysis of the top ten popular open source LLM of HuggingFace in the fourth week of November 2023 - the explosion of multi-modal large models and small-scale models (Nov 2023)
- Blog published on zaker: With this toolkit, the inference performance of large models can be accelerated by 40 times (Nov 2023)
- Blog published on geeky-gadgets: [New Intel Neural-Chat 7B LLM tops Hugging Face leaderboard beating original Mistral 7B] (https://www.geeky-gadgets.com/intel-neural-chat-7b-llm/) (Nov 2023)
- Blog published on Huggingface: Intel Neural-Chat 7b: Fine-Tuning on Gaudi2 for Top LLM Performance (Nov 2023)
- Video on YouTube: Neural Chat 7B v3-1 Installation on Windows - Step by Step (Nov 2023)
- Video on YouTube: Intel's Neural-Chat 7b: Most Powerful 7B Model! Beats GPT-4!? (Nov 2023)
- Blog published on marktechpost: Intel Researchers Propose a New Artificial Intelligence Approach to Deploy LLMs on CPUs More Efficiently (Nov 2023)
- Blog published on VMware: AI without GPUs: A Technical Brief for VMware Private AI with Intel (Nov 2023)
- News releases on VMware: VMware Collaborates with Intel to Unlock Private AI Everywhere (Nov 2023)
- Video on YouTube: Build Your Own ChatBot with Neural Chat | Intel Software (Oct 2023)
- Blog published on Medium: Layer-wise Low-bit Weight Only Quantization on a Laptop (Oct 2023)
- Blog published on Medium: Intel-Optimized Llama.CPP in Intel Extension for Transformers (Oct 2023)
- Blog published on Medium: Reduce the Carbon Footprint of Large Language Models (Oct 2023)
- Blog on GOVINDH Tech: Neural Chat vs. Competitors: A Detailed Guide (Sep 2023)
View Full Publication List.
Additional Content
Acknowledgements
-
Excellent open-source projects: bitsandbytes, FastChat, fastRAG, ggml, gptq, llama.cpp, lm-evauation-harness, peft, trl, streamingllm and many others.
-
Thanks to all the contributors.
💁Collaborations
Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us, and we look forward to our collaborations on Intel Extension for Transformers!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file intel-extension-for-transformers-1.3.1.tar.gz
.
File metadata
- Download URL: intel-extension-for-transformers-1.3.1.tar.gz
- Upload date:
- Size: 96.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7bd1eacc11ee09d0f38e571f0fd18b5734aeb4319cb73ffc7bd1b651d3f2a8cf |
|
MD5 | 1e66f84f01ef5351a2cd03187405d7c7 |
|
BLAKE2b-256 | 0f1efa99f297ca10f217ae14013ba4d8f27d08b68780c269941a5a1ce9ae58f4 |
File details
Details for the file intel_extension_for_transformers-1.3.1-cp311-cp311-win_amd64.whl
.
File metadata
- Download URL: intel_extension_for_transformers-1.3.1-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 10.6 MB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 43f4522f7b09b0656717f352ab7db9f47a90a04572ded6981966295c19037c70 |
|
MD5 | b5ba4ece7aeecdd9376e4d3b6d324770 |
|
BLAKE2b-256 | c80cd89954632b183f0c102c4f5ca3634e1c1e56b0faf0cffdd897af07f43f22 |
File details
Details for the file intel_extension_for_transformers-1.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: intel_extension_for_transformers-1.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 44.2 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b86849e4038bb89c2a6f6965939f1f8ee0f3433a904816bf38bcb713ae22321a |
|
MD5 | 7751a08d839feea20c454778f1cf3d84 |
|
BLAKE2b-256 | 0344bd8a455a8b6abbd137bf36188655894aeb81b45ecc012d430c94d1d0dc01 |
File details
Details for the file intel_extension_for_transformers-1.3.1-cp310-cp310-win_amd64.whl
.
File metadata
- Download URL: intel_extension_for_transformers-1.3.1-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 10.6 MB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 993e7262ab606a2fbd032aa36d0603c61aba00381f3bfe90761982d5cb967718 |
|
MD5 | ca6eac78af2965abba51dc08f1cafd64 |
|
BLAKE2b-256 | 64d1ec15c19ac5e8121c6e04b6ec86918868f533dc4dcda35dc6c39b54472ad1 |
File details
Details for the file intel_extension_for_transformers-1.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: intel_extension_for_transformers-1.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 44.2 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2d80688350071652871fbaa2a49226b91c4eeefda08bb851f4a1ba0dea1b878 |
|
MD5 | 386650a4cd3336325004f90b23f6252a |
|
BLAKE2b-256 | ef8a14f0529db86959b3ec00eac943325685ec6a7347bfcb881e839bba0b8037 |
File details
Details for the file intel_extension_for_transformers-1.3.1-cp39-cp39-win_amd64.whl
.
File metadata
- Download URL: intel_extension_for_transformers-1.3.1-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 10.6 MB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 760e53772dfb003d6200d5a15168c63f66a9ba4390a19da61b8fb93173ff5570 |
|
MD5 | 443b0b66a058f577c29bb16512c55be6 |
|
BLAKE2b-256 | 909e9d41b109a053d7a8678c79593a0438f0714868a7b200937e73aee0753802 |
File details
Details for the file intel_extension_for_transformers-1.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: intel_extension_for_transformers-1.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 44.2 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a64dc1e0b4623bdb2a07f1de1408de64ee5db45b622382d008b15df571121fdb |
|
MD5 | db4eb7e18177ea281f9bfe070126ba82 |
|
BLAKE2b-256 | 6bf084210d4cc9b93b1975dd857785bef8f8b8f4ed940fb4bfaa7d99803c61d1 |