Repository of Intel® Intel Extension for Transformers
Project description
Intel® Extension for Transformers
An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere
🏭Architecture | 💬NeuralChat | 😃Inference | 💻Examples | 📖Documentations
🚀Latest News
- [2023/10] LLM runtime, an Intel-optimized GGML compatible runtime, demonstrates up to 15x performance gain in 1st token generation and 1.5x in other token generation over the default llama.cpp.
- [2023/10] LLM runtime now supports LLM inference with infinite-length inputs up to 4 million tokens, inspired from StreamingLLM.
- [2023/09] NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next'23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors.
- [2023/08] NeuralChat supports custom chatbot development and deployment within minutes on broad Intel HWs such as Xeon Scalable Processors, Gaudi2, Xeon CPU Max Series, Data Center GPU Max Series, Arc Series, and Core Processors. Check out Notebooks.
- [2023/07] LLM runtime extends Hugging Face Transformers API to provide seamless low precision inference for popular LLMs, supporting low precision data types such as INT3/INT4/FP4/NF4/INT5/INT8/FP8.
🏃Installation
Quick Install from Pypi
pip install intel-extension-for-transformers
For more installation methods, please refer to Installation Page
🌟Introduction
Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular, effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the below key features and examples:
-
Seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
-
Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021's paper Prune Once for All: Sparse Pre-Trained Language Models)
-
Optimized Transformer-based model packages such as Stable Diffusion, GPT-J-6B, GPT-NEOX, BLOOM-176B, T5, Flan-T5, and end-to-end workflows such as SetFit-based text classification and document level sentiment analysis (DLSA)
-
NeuralChat, a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of plugins Knowledge Retrieval, Speech Interaction, Query Caching, Security Guardrail.
-
Inference of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels, supporting GPT-NEOX, LLAMA, MPT, FALCON, BLOOM-7B, OPT, ChatGLM2-6B, GPT-J-6B and Dolly-v2-3B
🌱Getting Started
Below is the sample code to enable the chatbot. See more examples.
Chatbot
# pip install intel-extension-for-transformers
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
Below is the sample code to enable weight-only INT4/INT8 inference. See more examples.
INT4 Inference
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
model_name = "Intel/neural-chat-7b-v1-1" # Hugging Face model_id or local model
config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
INT8 Inference
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
model_name = "Intel/neural-chat-7b-v1-1" # Hugging Face model_id or local model
config = WeightOnlyQuantConfig(compute_dtype="bf16", weight_dtype="int8")
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
🎯Validated Models
You can access the latest int4 performance and accuracy at int4 blog.
Additionally, we are preparing to introduce Baichuan, Mistral, and other models into LLM Runtime (Intel Optimized llamacpp). For comprehensive accuracy and performance data, though not the most up-to-date, please refer to the Release data.
📖Documentation
OVERVIEW | |||||||
---|---|---|---|---|---|---|---|
NeuralChat | LLM Runtime | ||||||
NEURALCHAT | |||||||
Chatbot on Intel CPU | Chatbot on Intel GPU | Chatbot on Gaudi | |||||
Chatbot on Client | More Notebooks | ||||||
LLM RUNTIME | |||||||
LLM Runtime | Streaming LLM | Low Precision Kernels | Tensor Parallelism | ||||
LLM COMPRESSION | |||||||
SmoothQuant (INT8) | Weight-only Quantization (INT4/FP4/NF4/INT8) | QLoRA on CPU | |||||
GENERAL COMPRESSION | |||||||
Quantization | Pruning | Distillation | Orchestration | ||||
Neural Architecture Search | Export | Metrics | Objectives | ||||
Pipeline | Length Adaptive | Early Exit | Data Augmentation | ||||
TUTORIALS & RESULTS | |||||||
Tutorials | LLM List | General Model List | Model Performance |
🙌Demo
- Infinite inference (up to 4M tokens)
📃Selected Publications/Events
- Blog published on Medium: NeuralChat: Simplifying Supervised Instruction Fine-tuning and Reinforcement Aligning for Chatbots (Sep 2023)
- Intel Innovation'23 Keynote: Intel Innovation 2023 Keynote by Greg Lavender (Sep 2023)
- Blog on Intel Community: NeuralChat: A Customizable Chatbot Framework (Sep 2023)
- Blog published on Medium: NeuralChat: A Customizable Chatbot Framework (Sep 2023)
- Blog published on Medium: Faster Stable Diffusion Inference with Intel Extension for Transformers (July 2023)
- Blog of Intel Developer News: The Moat Is Trust, Or Maybe Just Responsible AI (July 2023)
- Blog of Intel Developer News: Create Your Own Custom Chatbot (July 2023)
- Blog of Intel Developer News: Accelerate Llama 2 with Intel AI Hardware and Software Optimizations (July 2023)
- Arxiv: An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs (June 2023)
- Blog published on Medium: Simplify Your Custom Chatbot Deployment (June 2023)
View Full Publication List.
Additional Content
Acknowledgements
-
Excellent open-source projects: bitsandbytes, FastChat, fastRAG, ggml, gptq, llama.cpp, lm-evauation-harness, peft, trl, streamingllm and many others.
-
Thanks to all the contributors including Ikko Eltociear Ashimine, Hardik Kamboj, Sangjune Park, Kevin Ta, Huiyan Cao, Xigui Wang, Jiafu Zhang, Tyler Titsworth, Yi Wang, Samanway Sadhu, Jiqing Feng, Jonathan Mamou and Niroop Ammbashankar.
💁Collaborations
Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us, and we look forward to our collaborations on Intel Extension for Transformers!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file intel-extension-for-transformers-1.2.1.tar.gz
.
File metadata
- Download URL: intel-extension-for-transformers-1.2.1.tar.gz
- Upload date:
- Size: 88.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b86b4dbd91f419cc5186929b2083822508340b1d058407dcf3568f72adc44aec |
|
MD5 | e80319d1d4f51e9c2660f29e1a2e80a6 |
|
BLAKE2b-256 | 32a9243bf5b9ff825ec566ee153931135b5bcb3179f87ea4cd055500ee40f497 |
File details
Details for the file intel_extension_for_transformers-1.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: intel_extension_for_transformers-1.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 81.4 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 933ba6db3ee056eacee6c48443e280e0a1824615dce104ad12ac65ea79f18cda |
|
MD5 | fd9bf925829a40b5f46f31cfd644e100 |
|
BLAKE2b-256 | a960092af8234a5fdd535dd5a260d6ec0766b31d6ef0b053ce69935c8eb223ee |
File details
Details for the file intel_extension_for_transformers-1.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: intel_extension_for_transformers-1.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 81.4 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1385396526d07665bb750086804a4aea7aaceb3667b3b4032976e92674f56a35 |
|
MD5 | eca2fd65f5d462cfe4571042a8f600de |
|
BLAKE2b-256 | 801df78cd72f79da52f9896a01daeed9a6c1577739e91efba61a5b39b63ff9e5 |
File details
Details for the file intel_extension_for_transformers-1.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: intel_extension_for_transformers-1.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 81.4 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3d9696c36ec6aef00f92df0edb9dc51b3fed59dbd1260d2d7fa3302821c0e76 |
|
MD5 | d152f3e7e9109e63c5c212d72894f5a8 |
|
BLAKE2b-256 | 654aed36b2f169ebb6df3d2f57b9fb794e93c98f3b32a981d0d97ec2d76abb51 |