Skip to main content

Repository of Intel® Intel Extension for Transformers

Project description

Intel® Extension for Transformers

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere

Release Notes

🏭Architecture   |   💬NeuralChat   |   😃Inference   |   💻Examples   |   📖Documentations

🚀Latest News

  • NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next'23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors.
  • NeuralChat supports custom chatbot development and deployment on broad Intel HWs such as Xeon Scalable Processors, Gaudi2, Xeon CPU Max Series, Data Center GPU Max Series, Arc Series, and Core Processors. Check out Notebooks and below sample code.
# pip install intel-extension-for-transformers
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
  • LLM runtime extends Hugging Face Transformers API to provide seamless low precision inference for popular LLMs, supporting mainstream low precision data types such as INT8/FP8/INT4/FP4/NF4.

🏃Installation

Quick Install from Pypi

pip install intel-extension-for-transformers

For more installation methods, please refer to Installation Page

🌟Introduction

Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the below key features and examples:

🌱Getting Started

Below are the sample code to enable weight-only low precision inference. See more examples.

INT4 Inference

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModel, WeightOnlyQuantConfig

model_name = "EleutherAI/gpt-j-6B"
config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModel.from_pretrained(model_name, quantization_config=config)
gen_tokens = model.generate(inputs, max_new_tokens=300)
gen_text = tokenizer.batch_decode(gen_tokens)

INT8 Inference

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModel, WeightOnlyQuantConfig

model_name = "EleutherAI/gpt-j-6B" 
config = WeightOnlyQuantConfig(compute_dtype="bf16", weight_dtype="int8")
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModel.from_pretrained(model_name, quantization_config=config)
gen_tokens = model.generate(inputs, max_new_tokens=300)
gen_text = tokenizer.batch_decode(gen_tokens)

🎯Validated Models

Here is the average accuracy of validated models on Lambada (OpenAI), HellaSwag, Winogrande, PIQA, and WikiText. The next token latency is based on 32 input tokens and greedy search on Intel's 4th Generation Xeon Scalable Sapphire Rapids processor.

Model FP32 INT4 (Group size 32) INT4 (Group size 128) Next Token Latency
EleutherAI/gpt-j-6B 0.643 0.644 0.64 21.98ms
meta-llama/Llama-2-7b-hf 0.69 0.69 0.685 24.55ms
decapoda-research/llama-7b-hf 0.689 0.682 0.68 24.84ms
EleutherAI/gpt-neox-20b 0.674 0.672 0.669 80.16ms
mosaicml/mpt-7b-chat 0.672 0.67 0.666 35.84ms
tiiuae/falcon-7b 0.698 0.694 0.693 36.1ms
baichuan-inc/baichuan-7B 0.474 0.471 0.47 Coming Soon
facebook/opt-6.7b 0.65 0.647 0.643 Coming Soon
databricks/dolly-v2-3b 0.613 0.609 0.609 22.02ms
tiiuae/falcon-40b-instruct 0.756 0.757 0.755 Coming Soon

Find other models like ChatGLM, ChatGLM2, StarCoder... in LLM Runtime

📖Documentation

OVERVIEW
Model Compression NeuralChat Neural Engine Kernel Libraries
MODEL COMPRESSION
Quantization Pruning Distillation Orchestration
Neural Architecture Search Export Metrics/Objectives Pipeline
NEURAL ENGINE
Model Compilation Custom Pattern Deployment Profiling
KERNEL LIBRARIES
Sparse GEMM Kernels Custom INT8 Kernels Profiling Benchmark
ALGORITHMS
Length Adaptive Data Augmentation
TUTORIALS AND RESULTS
Tutorials Supported Models Model Performance Kernel Performance

📃Selected Publications/Events

View Full Publication List.

Additional Content

Acknowledgements

💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us and look forward to our collaborations on Intel Extension for Transformers!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intel-extension-for-transformers-1.2.tar.gz (73.3 MB view details)

Uploaded Source

Built Distributions

intel_extension_for_transformers-1.2-cp310-cp310-win_amd64.whl (21.5 MB view details)

Uploaded CPython 3.10Windows x86-64

intel_extension_for_transformers-1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (74.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

intel_extension_for_transformers-1.2-cp39-cp39-win_amd64.whl (21.5 MB view details)

Uploaded CPython 3.9Windows x86-64

intel_extension_for_transformers-1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (74.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

intel_extension_for_transformers-1.2-cp38-cp38-win_amd64.whl (21.5 MB view details)

Uploaded CPython 3.8Windows x86-64

intel_extension_for_transformers-1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (74.0 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file intel-extension-for-transformers-1.2.tar.gz.

File metadata

File hashes

Hashes for intel-extension-for-transformers-1.2.tar.gz
Algorithm Hash digest
SHA256 5ab3589039733492c65427bab9bf08dc0bd0e5915a81d83a848d4b89ed6ecbb4
MD5 b2f1da1156846b971ba560cd5496b813
BLAKE2b-256 828478eab558cbeba5e5490867e38e267a154592bd435ab8bd967323a7b60f13

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.2-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 0d74aff0a8d1f90ee61ff808409f541c8b9880b41ad7d1df31075570502ca34a
MD5 55596f962aaa2d460e8f99577a2373cb
BLAKE2b-256 ac55f68a7ea18d74e3a162b5d172d5c92d56b587d9b32c543e8ac8b6091afd6a

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e8dd9772665696072b78b829848649bc3728e7d6c95411ea535b6dea5630db0b
MD5 61204234a88c518ea6aaa08218139648
BLAKE2b-256 1615c46218743d794604670263360c99a82ecca2b34542e8646378c29b4df799

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.2-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 5f169308921397bf2a614ffc491324ce12db397f7b5d2185d39fefce2f3add53
MD5 124bbdc49b15865cae9b5d5d319cd6b4
BLAKE2b-256 d4c2f340f0b2f22df011c864509e1cd4183c1fb154a2167824cdd34fd39a2ec3

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ea595ebebb72e48944e6a5ccad419efdbc6652bbb5f8d281cf719724b1d94c81
MD5 69a56194d2b5286b17d335f6eea13475
BLAKE2b-256 449e643d532c6c2277eddb765f32d5da5a3622ff16077339577e7704a08e8119

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.2-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 28d0ff9814d4a0a432ee2cc08b777d7d5d8cf84fb3b664f3b67408ad4ef591ad
MD5 26c065d5e836e51b52d81df9559f9a0e
BLAKE2b-256 c332a85e0608e7cb59d532990771d482d3190686d15a10c92819725fe09811b9

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3e60f07424b079d01aad818ff1238d547613fbdd9e7298a5ac3aa252a2fd81bb
MD5 a2e6ad7a72ac1eadeecad4513d98dbac
BLAKE2b-256 fe80e01bcfd7ce587122223bcfc4e912505d38c39ec7d1372d388b6c7e81046f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page