Skip to main content

Repository of Intel® Intel Extension for Transformers

Project description

Intel® Extension for Transformers: Accelerating Transformer-based Models on Intel Platforms

Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms. The toolkit helps developers to improve the productivity through ease-of-use model compression APIs by extending Hugging Face transformers APIs. The compression infrastructure leverages Intel® Neural Compressor which provides a rich set of model compression techniques: quantization, pruning, distillation and so on. The toolkit provides Transformers-accelerated Libraries and Neural Engine to demonstrate the performance of extremely compressed models, and therefore significantly improve the inference efficiency on Intel platforms. Some of the key features have been published in NeurIPS 2021 and 2022.

What does Intel® Extension for Transformers offer?

This toolkit helps developers to improve the productivity of inference deployment by extending Hugging Face transformers APIs for Transformer-based models in natural language processing (NLP) domain. With extremely compressed models, the toolkit can greatly improve the inference efficiency on Intel platforms.

  • Model Compression

    Framework Quantization Pruning/Sparsity Distillation Neural Architecture Search
    PyTorch
    TensorFlow Stay tuned :star:
  • Data Augmentation for NLP Datasets

  • Transformers-accelerated Neural Engine

  • Transformers-accelerated Libraries

  • Domain Algorithms

    Length Adaptive Transformer
    PyTorch ✔
  • Architecture of Intel® Extension for Transformers

arch

Documentation

OVERVIEW
Model   Compression Transformers-accelerated   Neural Engine Transformers-accelerated   Libraries Examples
BASIC API
Export Metric Pipeline Objective Data Augmentation
Compile (ONNX/TensorFlow) Deploy and Integration Add Customize pattern
DEEP DIVE
Quantization Pruning Distillation Orchestration
Transformers-accelerated Neural Engine Kernels (AMX/AVX/VNNI)
ADVANCED ALGORITHM
Length Adaptive NAS (Auto Distillation) Set Fit
PROFILING AND BENCHMARK VALIDATED MODELS AND DATA
Model Compression Transformers-accelerated Neural Engine Transformers-accelerated Libraries Profiling/ Benchmark Supported Models Sparse Aware Inference Sparse Kernel Data
TUTORIALS

Installation

Release Binary Install

pip install intel-extension-for-transformers

Install From Source

Install Intel® Extension for Transformers

git clone https://github.com/intel/intel-extension-for-transformers.git intel_extension_for_transformers
cd intel_extension_for_transformers
# Install Dependency
pip install -r requirements.txt
git submodule update --init --recursive
# Install intel_extension_for_transformers
python setup.py install

Getting Started

Quantization

from intel_extension_for_transformers import QuantizationConfig, metric, objectives
from intel_extension_for_transformers.optimization.trainer import NLPTrainer

# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
trainer = NLPTrainer(...)
metric = metrics.Metric(name="eval_f1", is_relative=True, criterion=0.01)
q_config = QuantizationConfig(
    approach="PostTrainingStatic",
    metrics=[metric],
    objectives=[objectives.performance]
)
model = trainer.quantize(quant_config=q_config)

Please refer to quantization document for more details.

Pruning

from intel_extension_for_transformers import PrunerConfig, PruningConfig
from intel_extension_for_transformers.optimization.trainer import NLPTrainer

# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
trainer = NLPTrainer(...)
metric = metrics.Metric(name="eval_accuracy")
pruner_config = PrunerConfig(prune_type='BasicMagnitude', target_sparsity_ratio=0.9)
p_conf = PruningConfig(pruner_config=[pruner_config], metrics=metric)
model = trainer.prune(pruning_config=p_conf)

Please refer to pruning document for more details.

Distillation

from intel_extension_for_transformers import DistillationConfig, Criterion
from intel_extension_for_transformers.optimization.trainer import NLPTrainer

# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
teacher_model = ... # exist model
trainer = NLPTrainer(...)
metric = metrics.Metric(name="eval_accuracy")
d_conf = DistillationConfig(metrics=metric)
model = trainer.distill(distillation_config=d_conf, teacher_model=teacher_model)

Please refer to distillation document for more details.

Data Augmentation

Data augmentation provides the facilities to generate synthesized NLP dataset for further model optimization. The data augmentation supports text generation on popular fine-tuned models like GPT, GPT2, and other text synthesis approaches from nlpaug.

from intel_extension_for_transformers.preprocessing.data_augmentation import DataAugmentation
aug = DataAugmentation(augmenter_type="TextGenerationAug")
aug.input_dataset = "original_dataset.csv" # example: https://huggingface.co/datasets/glue/viewer/sst2/train
aug.column_names = "sentence"
aug.output_path = os.path.join(self.result_path, "test2.cvs")
aug.augmenter_arguments = {'model_name_or_path': 'gpt2-medium'}
aug.data_augment()
raw_datasets = load_dataset("csv", data_files=aug.output_path, delimiter="\t", split="train")

Please refer to data augmentation document for more details.

Neural Engine

Neural Engine is one of reference deployments that Intel Extension for Transformers provides. Neural Engine aims to demonstrate the optimal performance of extremely compressed NLP models by exploring the optimization opportunities from both HW and SW.

from intel_extension_for_transformers.backends.neural_engine.compile import compile
# /path/to/your/model is a TensorFlow pb model or ONNX model
model = compile('/path/to/your/model')
inputs = ... # [input_ids, segment_ids, input_mask]
model.inference(inputs)

Please refer to Neural Engine for more details.

Quantized Length Adaptive Transformer

Quantized Length Adaptive Transformer leverages sequence-length reduction and low-bit representation techniques to further enhance model inference performance, enabling adaptive sequence-length sizes to accommodate different computational budget requirements with an optimal accuracy efficiency tradeoff.

from intel_extension_for_transformers import QuantizationConfig, DynamicLengthConfig, metric, objectives
from intel_extension_for_transformers.optimization.trainer import NLPTrainer

# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
trainer = NLPTrainer(...)
metric = metrics.Metric(name="eval_f1", is_relative=True, criterion=0.01)
q_config = QuantizationConfig(
    approach="PostTrainingStatic",
    metrics=[metric],
    objectives=[objectives.performance]
)
# Apply the length config
dynamic_length_config = DynamicLengthConfig(length_config=length_config)
trainer.set_dynamic_config(dynamic_config=dynamic_length_config)
# Quantization
model = trainer.quantize(quant_config=q_config)

Please refer to paper QuaLA-MiniLM and code for details

Transformers-accelerated Neural Engine

Transformers-accelerated Neural Engine is one of reference deployments that Intel® Extension for Transformers provides. Neural Engine aims to demonstrate the optimal performance of extremely compressed NLP models by exploring the optimization opportunities from both HW and SW.

from intel_extension_for_transformers.backends.neural_engine.compile import compile
# /path/to/your/model is a TensorFlow pb model or ONNX model
model = compile('/path/to/your/model')
inputs = ... # [input_ids, segment_ids, input_mask]
model.inference(inputs)

Please refer to example in Transformers-accelerated Neural Engine and paper Fast Distilbert on CPUs for more details.

Transformers-accelerated Libraries

Transformers-accelerated Libraries is a high-performance operator computing library implemented by assembly. Transformers-accelerated Libraries contains a JIT domain, a kernel domain, and a scheduling proxy framework.

#include "interface.hpp"
  ...
  operator_desc op_desc(ker_kind, ker_prop, eng_kind, ts_descs, op_attrs);
  sparse_matmul_desc spmm_desc(op_desc);
  sparse_matmul spmm_kern(spmm_desc);
  std::vector<const void*> rt_data = {data0, data1, data2, data3, data4};
  spmm_kern.execute(rt_data);

Please refer to Transformers-accelerated Libraries for more details.

System Requirements

Validated Hardware Environment

Intel® Extension for Transformers supports systems based on Intel 64 architecture or compatible processors that are specifically optimized for the following CPUs:

  • Intel Xeon Scalable processor (formerly Cascade Lake, Icelake)
  • Future Intel Xeon Scalable processor (code name Sapphire Rapids)

Validated Software Environment

  • OS version: CentOS 8.4, Ubuntu 20.04
  • Python version: 3.7, 3.8, 3.9, 3.10
Framework TensorFlow Intel TensorFlow PyTorch IPEX
Version 2.10.0
2.9.1
2.10.0
2.9.1
1.13.0+cpu
1.12.0+cpu
1.13.0
1.12.0

Selected Publications/Events

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intel_extension_for_transformers-1.0b0.tar.gz (179.9 kB view details)

Uploaded Source

Built Distributions

intel_extension_for_transformers-1.0b0-cp39-cp39-win_amd64.whl (8.5 MB view details)

Uploaded CPython 3.9 Windows x86-64

intel_extension_for_transformers-1.0b0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (50.1 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

intel_extension_for_transformers-1.0b0-cp38-cp38-win_amd64.whl (8.5 MB view details)

Uploaded CPython 3.8 Windows x86-64

intel_extension_for_transformers-1.0b0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (50.1 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

intel_extension_for_transformers-1.0b0-cp37-cp37m-win_amd64.whl (8.5 MB view details)

Uploaded CPython 3.7m Windows x86-64

intel_extension_for_transformers-1.0b0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (50.1 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

File details

Details for the file intel_extension_for_transformers-1.0b0.tar.gz.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.0b0.tar.gz
Algorithm Hash digest
SHA256 2022027c4f5db1335266a802528102fe2126141d37f4d0a191d8cd7a18cbf4a4
MD5 17c0130c1329b8d4419862a2c7710d3f
BLAKE2b-256 fa8d019475c55d457e305ccbdb608d4af00f08bac60120fe6dfc53bac19f9e47

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.0b0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.0b0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 0b0645fc5047230aef9b42f14e0b570889019d204481bef40a42810129ac21d3
MD5 f1cb00b652cf21191ba3f9a1d57ed388
BLAKE2b-256 7b42a0be821b4e626a666151e207e8e577c28603411d843d0e9e192731bd78e5

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.0b0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.0b0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 48d85d4ec3af6f7d58b7762dea3c653acaf465911574958a2f3f24f5bb7c9d08
MD5 86aab87419171f279a9a8622018ac79d
BLAKE2b-256 75fc1b65ecd8ea903eaad7dd50e55857c9d3e859e215f637eaee2e7ce4fbf99b

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.0b0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.0b0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 1da84b1d7e6900dc0311f09a159748e8a44aa484b2f83d8a02835ada6340d839
MD5 71b1a79dd80d9c25b2875054b6f33159
BLAKE2b-256 20dadd7b867e096a90703cc7edafab07bd89af90972db9a7577abc9b4f6f5c74

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.0b0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.0b0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1c0d8681add93df29c88752e9bf994c37c529f8dc0b422ed83618707107f75a8
MD5 195af2884c45bc1f96a3833228e3e38b
BLAKE2b-256 0bd2ac4ebb62cbc3d568b4fc09af0326a5d87dacfeb5688fcdf4755fcfb301c8

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.0b0-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.0b0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 8146958214f117f4a060f4da6581701cfba93a3f16508fdc3a04d2015efd4619
MD5 be6ed2fb256e2b31aa9d22c35bf43108
BLAKE2b-256 9d0c1fa3b030045bf7d985491d2b35b3ca644a685edbc8c031fcd3a7fbfe230d

See more details on using hashes here.

File details

Details for the file intel_extension_for_transformers-1.0b0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for intel_extension_for_transformers-1.0b0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 93d328db15a5483c62e21b3ebf2ca94ec50474ac37e3037db904529ebf32b53c
MD5 733280ebbfaff6c0ce43b8d603bd9e9c
BLAKE2b-256 4ecbafd4784c0bcd73b0e029fa3e03c85f9feaea15e222d6d5e78ee9fea56172

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page