Skip to main content

Repository of Intel® Intel Extension for Transformers

Project description

Intel® Extension for Transformers

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere

Release Notes

🏭Architecture   |   💬NeuralChat   |   😃Inference   |   💻Examples   |   📖Documentations

🚀Latest News

  • [2023/12] Supported QLoRA on CPUs to make fine-tuning on client CPU possible. Check out the blog and readme for more details.
  • [2023/11] Demonstrated up to 3x LLM inference speedup using Assisted Generation (also called Speculative Decoding) from Hugging Face with Intel optimizations! Check out more details.
  • [2023/11] Refreshed top-1 7B-sized LLM by releasing NeuralChat-v3-1. Check out the nice video published by WorldofAI.
  • [2023/11] Released NeuralChat-v3, new top-1 7B-sized LLM available on Hugging Face. The model is fine-tuned on Intel Gaudi2 with supervised fine-tuning and direct preference optimization. Check out the blog.
  • [2023/11] Published a 4-bit chatbot demo (based on NeuralChat) available on Intel Hugging Face Space. Welcome to have a try! To setup the demo locally, please follow the instructions.
  • [2023/11] Released Fast, accurate, and infinite LLM inference with improved StreamingLLM on Intel CPUs!
  • [2023/11] Our paper Efficient LLM Inference on CPUs has been accepted by NeurIPS'23 on Efficient Natural Language and Speech Processing. Thanks to all the collaborators!
  • [2023/10] LLM runtime, an Intel-optimized GGML compatible runtime, demonstrates up to 15x performance gain in 1st token generation and 1.5x in other token generation over the default llama.cpp.
  • [2023/10] LLM runtime now supports LLM inference with infinite-length inputs up to 4 million tokens, inspired from StreamingLLM.
  • [2023/09] NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next'23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors.
  • [2023/08] NeuralChat supports custom chatbot development and deployment within minutes on broad Intel HWs such as Xeon Scalable Processors, Gaudi2, Xeon CPU Max Series, Data Center GPU Max Series, Arc Series, and Core Processors. Check out Notebooks.
  • [2023/07] LLM runtime extends Hugging Face Transformers API to provide seamless low precision inference for popular LLMs, supporting low precision data types such as INT3/INT4/FP4/NF4/INT5/INT8/FP8.

🏃Installation

Quick Install from Pypi

pip install intel-extension-for-transformers

For more installation methods, please refer to Installation Page

🌟Introduction

Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples:

🔓Validated Hardware

Hardware Fine-Tuning Inference
Full PEFT 8-bit 4-bit
Intel Gaudi2 WIP (FP8) -
Intel Xeon Scalable Processors ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)
Intel Xeon CPU Max Series ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)
Intel Core Processors - ✔ (INT8, FP8) ✔ (INT4, FP4, NF4)

In the table above, "-" means not applicable or not started yet.

Validated Software

Software Fine-Tuning Inference
Full PEFT 8-bit 4-bit
PyTorch 2.0.1+cpu 2.0.1+cpu 2.1.0+cpu 2.1.0+cpu
Intel® Extension for PyTorch 2.1.0+cpu 2.1.0+cpu 2.1.0+cpu 2.1.0+cpu
Transformers 4.35.2 4.35.2 4.35.2 4.35.2
Synapse AI 1.13.0 1.13.0 1.13.0 1.13.0
Gaudi2 driver 1.13.0-ee32e42 1.13.0-ee32e42 1.13.0-ee32e42 1.13.0-ee32e42

Please refer to the detailed requirements in CPU, Gaudi2.

🌱Getting Started

Below is the sample code to create your chatbot. See more examples.

Chatbot

# pip install intel-extension-for-transformers
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")

Below is the sample code to enable weight-only INT4/INT8 inference. See more examples.

INT4 Inference (CPU only)

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

INT8 Inference (CPU only)

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

🎯Validated Models

You can access the latest int4 performance and accuracy at int4 blog.

Additionally, we are preparing to introduce Baichuan, Mistral, and other models into LLM Runtime (Intel Optimized llamacpp). For comprehensive accuracy and performance data, though not the most up-to-date, please refer to the Release data.

📖Documentation

OVERVIEW
NeuralChat LLM Runtime
NEURALCHAT
Chatbot on Intel CPU Chatbot on Intel GPU Chatbot on Gaudi
Chatbot on Client More Notebooks
LLM RUNTIME
LLM Runtime Streaming LLM Low Precision Kernels Tensor Parallelism
LLM COMPRESSION
SmoothQuant (INT8) Weight-only Quantization (INT4/FP4/NF4/INT8) QLoRA on CPU
GENERAL COMPRESSION
Quantization Pruning Distillation Orchestration
Neural Architecture Search Export Metrics Objectives
Pipeline Length Adaptive Early Exit Data Augmentation
TUTORIALS & RESULTS
Tutorials LLM List General Model List Model Performance

🙌Demo

  • LLM Infinite Inference (up to 4M tokens)

https://github.com/intel/intel-extension-for-transformers/assets/109187816/1698dcda-c9ec-4f44-b159-f4e9d67ab15b

  • LLM QLoRA on Client CPU

https://github.com/intel/intel-extension-for-transformers/assets/88082706/9d9bdb7e-65db-47bb-bbed-d23b151e8b31

📃Selected Publications/Events

View Full Publication List.

Additional Content

Acknowledgements

💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us, and we look forward to our collaborations on Intel Extension for Transformers!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intel-extension-for-transformers-1.3.tar.gz (100.0 MB view details)

Uploaded Source

Built Distributions

intel_extension_for_transformers-1.3-cp311-cp311-win_amd64.whl (17.6 MB view details)

Uploaded CPython 3.11 Windows x86-64

intel_extension_for_transformers-1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (59.9 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

intel_extension_for_transformers-1.3-cp310-cp310-win_amd64.whl (17.6 MB view details)

Uploaded CPython 3.10 Windows x86-64

intel_extension_for_transformers-1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (59.9 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

intel_extension_for_transformers-1.3-cp39-cp39-win_amd64.whl (17.6 MB view details)

Uploaded CPython 3.9 Windows x86-64

intel_extension_for_transformers-1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (59.9 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

File details

Details for the file intel-extension-for-transformers-1.3.tar.gz.

File metadata

File hashes

Hashes for intel-extension-for-transformers-1.3.tar.gz
Algorithm Hash digest
SHA256 7d4126ebf5da2deda9ae0bc60476e52ab1bf37629c346a683476d2ed7ba9b71d
MD5 7704b1f62fd124d314502cce510a6e21
BLAKE2b-256 59b797f720d8e88933e1d2f4f6fc58d395657278687c6e000ac23ae175676704

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 87056cb0edc07a3dee45253ac6cbe16aab4cb68a29ee2110098be10687df054c
MD5 20e0c192f7206d9b92a5581c94508921
BLAKE2b-256 41c09bce09759433d3988707c6c448ea350bd6fa1f010a7d36eea9aa63e9d8c6

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2abfc7406b00a8c030cf00eea6d5d4c3364ecb9926dc6171af3a1447a1b67b0e
MD5 e753331c4cb1d423c1a0df458913e6d0
BLAKE2b-256 a535d331f2afa03d7878def2b0491f5aa1bdda238719f538c00c682c23f2a6ec

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 bf7271eceac6a167860f680993b66ded7c87995c5534e8e0b397def9265c454d
MD5 cfb0984ac2f5b960d4e3838bfe7cfc6c
BLAKE2b-256 b37dd5c99ffd780b5c6cd75b789e0f1c0afb012fc8aa163ca501e882b0d8c7da

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6cd404b1f62100e033840347bcc0fb05e4243c0a207bddf2906a41e0afeb8314
MD5 5fe260bbe29cc420d17afe44663647e7
BLAKE2b-256 515395f8e1ec6556b1ee7ded74e87d4510db03ba10582ae41000152c331b7667

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 cc434d3787d6a127fc95fc381b01672e1afa0db1a954583195af62b0d467abe7
MD5 eef59cd61a24914ba18e02cb385675c9
BLAKE2b-256 18d334efac35069cd600d5b0365ad52407e451e26f34f514c3c1796dd3bf8555

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d95c2d224a33514bbf7d01e4090adedd8a9a2e98aac1bacd0d1dfab55590b406
MD5 ca0098937e3cb3e98e7e6016dbc8a137
BLAKE2b-256 cc9cc3082f030e66655249fd693f5c43d26d857abf7a9f1d7fcd5f4b5af75301

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page