Skip to main content

Repository of Intel® Intel Extension for Transformers

Project description

Intel® Extension for Transformers

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere

Release Notes

🏭Architecture   |   💬NeuralChat   |   😃Inference   |   💻Examples   |   📖Documentations

🚀Latest News

  • [2023/10] LLM runtime, an Intel-optimized GGML compatible runtime, demonstrates up to 15x performance gain in 1st token generation and 1.5x in other token generation over the default llama.cpp.
  • [2023/10] LLM runtime now supports LLM inference with infinite-length inputs up to 4 million tokens, inspired from StreamingLLM.
  • [2023/09] NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next'23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors.
  • [2023/08] NeuralChat supports custom chatbot development and deployment within minutes on broad Intel HWs such as Xeon Scalable Processors, Gaudi2, Xeon CPU Max Series, Data Center GPU Max Series, Arc Series, and Core Processors. Check out Notebooks.
  • [2023/07] LLM runtime extends Hugging Face Transformers API to provide seamless low precision inference for popular LLMs, supporting low precision data types such as INT3/INT4/FP4/NF4/INT5/INT8/FP8.

🏃Installation

Quick Install from Pypi

pip install intel-extension-for-transformers

For more installation methods, please refer to Installation Page

🌟Introduction

Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular, effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the below key features and examples:

🌱Getting Started

Below is the sample code to enable the chatbot. See more examples.

Chatbot

# pip install intel-extension-for-transformers
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")

Below is the sample code to enable weight-only INT4/INT8 inference. See more examples.

INT4 Inference

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

INT8 Inference

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
config = WeightOnlyQuantConfig(compute_dtype="bf16", weight_dtype="int8")
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

🎯Validated Models

You can access the latest int4 performance and accuracy at int4 blog.

Additionally, we are preparing to introduce Baichuan, Mistral, and other models into LLM Runtime (Intel Optimized llamacpp). For comprehensive accuracy and performance data, though not the most up-to-date, please refer to the Release data.

📖Documentation

OVERVIEW
NeuralChat LLM Runtime
NEURALCHAT
Chatbot on Intel CPU Chatbot on Intel GPU Chatbot on Gaudi
Chatbot on Client More Notebooks
LLM RUNTIME
LLM Runtime Streaming LLM Low Precision Kernels Tensor Parallelism
LLM COMPRESSION
SmoothQuant (INT8) Weight-only Quantization (INT4/FP4/NF4/INT8) QLoRA on CPU
GENERAL COMPRESSION
Quantization Pruning Distillation Orchestration
Neural Architecture Search Export Metrics Objectives
Pipeline Length Adaptive Early Exit Data Augmentation
TUTORIALS & RESULTS
Tutorials LLM List General Model List Model Performance

🙌Demo

  • Infinite inference (up to 4M tokens)

https://github.com/intel/intel-extension-for-transformers/assets/109187816/1698dcda-c9ec-4f44-b159-f4e9d67ab15b

📃Selected Publications/Events

View Full Publication List.

Additional Content

Acknowledgements

💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us, and we look forward to our collaborations on Intel Extension for Transformers!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distributions

intel_extension_for_transformers-1.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (81.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

intel_extension_for_transformers-1.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (81.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

intel_extension_for_transformers-1.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (81.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

File details

Details for the file intel-extension-for-transformers-1.2.1.tar.gz.

File metadata

File hashes

Hashes for intel-extension-for-transformers-1.2.1.tar.gz
Algorithm Hash digest
SHA256 b86b4dbd91f419cc5186929b2083822508340b1d058407dcf3568f72adc44aec
MD5 e80319d1d4f51e9c2660f29e1a2e80a6
BLAKE2b-256 32a9243bf5b9ff825ec566ee153931135b5bcb3179f87ea4cd055500ee40f497

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 933ba6db3ee056eacee6c48443e280e0a1824615dce104ad12ac65ea79f18cda
MD5 fd9bf925829a40b5f46f31cfd644e100
BLAKE2b-256 a960092af8234a5fdd535dd5a260d6ec0766b31d6ef0b053ce69935c8eb223ee

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1385396526d07665bb750086804a4aea7aaceb3667b3b4032976e92674f56a35
MD5 eca2fd65f5d462cfe4571042a8f600de
BLAKE2b-256 801df78cd72f79da52f9896a01daeed9a6c1577739e91efba61a5b39b63ff9e5

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f3d9696c36ec6aef00f92df0edb9dc51b3fed59dbd1260d2d7fa3302821c0e76
MD5 d152f3e7e9109e63c5c212d72894f5a8
BLAKE2b-256 654aed36b2f169ebb6df3d2f57b9fb794e93c98f3b32a981d0d97ec2d76abb51

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page