Skip to main content

Optimum Library is an extension of the Hugging Face Transformers library, providing a framework to integrate third-party libraries from Hardware Partners and interface with their specific functionality.

Project description

ONNX Runtime

Hugging Face Optimum

🤗 Optimum is an extension of 🤗 Transformers and Diffusers, providing a set of optimization tools enabling maximum efficiency to train and run models on targeted hardware, while keeping things easy to use.

Installation

🤗 Optimum can be installed using pip as follows:

python -m pip install optimum

If you'd like to use the accelerator-specific features of 🤗 Optimum, you can install the required dependencies according to the table below:

Accelerator Installation
ONNX Runtime pip install --upgrade --upgrade-strategy eager optimum[onnxruntime]
Intel Neural Compressor pip install --upgrade --upgrade-strategy eager optimum[neural-compressor]
OpenVINO pip install --upgrade --upgrade-strategy eager optimum[openvino]
NVIDIA TensorRT-LLM docker run -it --gpus all --ipc host huggingface/optimum-nvidia
AMD Instinct GPUs and Ryzen AI NPU pip install --upgrade --upgrade-strategy eager optimum[amd]
AWS Trainum & Inferentia pip install --upgrade --upgrade-strategy eager optimum[neuronx]
Habana Gaudi Processor (HPU) pip install --upgrade --upgrade-strategy eager optimum[habana]
FuriosaAI pip install --upgrade --upgrade-strategy eager optimum[furiosa]

The --upgrade --upgrade-strategy eager option is needed to ensure the different packages are upgraded to the latest possible version.

To install from source:

python -m pip install git+https://github.com/huggingface/optimum.git

For the accelerator-specific features, append optimum[accelerator_type] to the above command:

python -m pip install optimum[onnxruntime]@git+https://github.com/huggingface/optimum.git

Accelerated Inference

🤗 Optimum provides multiple tools to export and run optimized models on various ecosystems:

  • ONNX / ONNX Runtime
  • TensorFlow Lite
  • OpenVINO
  • Habana first-gen Gaudi / Gaudi2, more details here
  • AWS Inferentia 2 / Inferentia 1, more details here
  • NVIDIA TensorRT-LLM , more details here

The export and optimizations can be done both programmatically and with a command line.

Features summary

Features ONNX Runtime Neural Compressor OpenVINO TensorFlow Lite
Graph optimization :heavy_check_mark: N/A :heavy_check_mark: N/A
Post-training dynamic quantization :heavy_check_mark: :heavy_check_mark: N/A :heavy_check_mark:
Post-training static quantization :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
Quantization Aware Training (QAT) N/A :heavy_check_mark: :heavy_check_mark: N/A
FP16 (half precision) :heavy_check_mark: N/A :heavy_check_mark: :heavy_check_mark:
Pruning N/A :heavy_check_mark: :heavy_check_mark: N/A
Knowledge Distillation N/A :heavy_check_mark: :heavy_check_mark: N/A

OpenVINO

Before you begin, make sure you have all the necessary libraries installed :

pip install --upgrade --upgrade-strategy eager optimum[openvino]

It is possible to export 🤗 Transformers and Diffusers models to the OpenVINO format easily:

optimum-cli export openvino --model distilbert-base-uncased-finetuned-sst-2-english distilbert_sst2_ov

If you add --weight-format int8, the weights will be quantized to int8, check out our documentation for more detail on weight only quantization. To apply quantization on both weights and activations, you can find more information here.

To load a model and run inference with OpenVINO Runtime, you can just replace your AutoModelForXxx class with the corresponding OVModelForXxx class. To load a PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, you can set export=True when loading your model.

- from transformers import AutoModelForSequenceClassification
+ from optimum.intel import OVModelForSequenceClassification
  from transformers import AutoTokenizer, pipeline

  model_id = "distilbert-base-uncased-finetuned-sst-2-english"
  tokenizer = AutoTokenizer.from_pretrained(model_id)
- model = AutoModelForSequenceClassification.from_pretrained(model_id)
+ model = OVModelForSequenceClassification.from_pretrained("distilbert_sst2_ov")

  classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
  results = classifier("He's a dreadful magician.")

You can find more examples in the documentation and in the examples.

Neural Compressor

Before you begin, make sure you have all the necessary libraries installed :

pip install --upgrade --upgrade-strategy eager optimum[neural-compressor]

Dynamic quantization can be applied on your model:

optimum-cli inc quantize --model distilbert-base-cased-distilled-squad --output ./quantized_distilbert

To load a model quantized with Intel Neural Compressor, hosted locally or on the 🤗 hub, you can do as follows :

from optimum.intel import INCModelForSequenceClassification

model_id = "Intel/distilbert-base-uncased-finetuned-sst-2-english-int8-dynamic"
model = INCModelForSequenceClassification.from_pretrained(model_id)

You can find more examples in the documentation and in the examples.

ONNX + ONNX Runtime

Before you begin, make sure you have all the necessary libraries installed :

pip install optimum[exporters,onnxruntime]

It is possible to export 🤗 Transformers and Diffusers models to the ONNX format and perform graph optimization as well as quantization easily:

optimum-cli export onnx -m deepset/roberta-base-squad2 --optimize O2 roberta_base_qa_onnx

The model can then be quantized using onnxruntime:

optimum-cli onnxruntime quantize \
  --avx512 \
  --onnx_model roberta_base_qa_onnx \
  -o quantized_roberta_base_qa_onnx

These commands will export deepset/roberta-base-squad2 and perform O2 graph optimization on the exported model, and finally quantize it with the avx512 configuration.

For more information on the ONNX export, please check the documentation.

Run the exported model using ONNX Runtime

Once the model is exported to the ONNX format, we provide Python classes enabling you to run the exported ONNX model in a seemless manner using ONNX Runtime in the backend:

- from transformers import AutoModelForQuestionAnswering
+ from optimum.onnxruntime import ORTModelForQuestionAnswering
  from transformers import AutoTokenizer, pipeline

  model_id = "deepset/roberta-base-squad2"
  tokenizer = AutoTokenizer.from_pretrained(model_id)
- model = AutoModelForQuestionAnswering.from_pretrained(model_id)
+ model = ORTModelForQuestionAnswering.from_pretrained("roberta_base_qa_onnx")
  qa_pipe = pipeline("question-answering", model=model, tokenizer=tokenizer)
  question = "What's Optimum?"
  context = "Optimum is an awesome library everyone should use!"
  results = qa_pipe(question=question, context=context)

More details on how to run ONNX models with ORTModelForXXX classes here.

TensorFlow Lite

Before you begin, make sure you have all the necessary libraries installed :

pip install optimum[exporters-tf]

Just as for ONNX, it is possible to export models to TensorFlow Lite and quantize them:

optimum-cli export tflite \
  -m deepset/roberta-base-squad2 \
  --sequence_length 384  \
  --quantize int8-dynamic roberta_tflite_model

Accelerated training

🤗 Optimum provides wrappers around the original 🤗 Transformers Trainer to enable training on powerful hardware easily. We support many providers:

  • Habana's Gaudi processors
  • AWS Trainium instances, check here
  • ONNX Runtime (optimized for GPUs)

Habana

Before you begin, make sure you have all the necessary libraries installed :

pip install --upgrade --upgrade-strategy eager optimum[habana]
- from transformers import Trainer, TrainingArguments
+ from optimum.habana import GaudiTrainer, GaudiTrainingArguments

  # Download a pretrained model from the Hub
  model = AutoModelForXxx.from_pretrained("bert-base-uncased")

  # Define the training arguments
- training_args = TrainingArguments(
+ training_args = GaudiTrainingArguments(
      output_dir="path/to/save/folder/",
+     use_habana=True,
+     use_lazy_mode=True,
+     gaudi_config_name="Habana/bert-base-uncased",
      ...
  )

  # Initialize the trainer
- trainer = Trainer(
+ trainer = GaudiTrainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
      ...
  )

  # Use Habana Gaudi processor for training!
  trainer.train()

You can find more examples in the documentation and in the examples.

ONNX Runtime

- from transformers import Trainer, TrainingArguments
+ from optimum.onnxruntime import ORTTrainer, ORTTrainingArguments

  # Download a pretrained model from the Hub
  model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

  # Define the training arguments
- training_args = TrainingArguments(
+ training_args = ORTTrainingArguments(
      output_dir="path/to/save/folder/",
      optim="adamw_ort_fused",
      ...
  )

  # Create a ONNX Runtime Trainer
- trainer = Trainer(
+ trainer = ORTTrainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
      ...
  )

  # Use ONNX Runtime for training!
  trainer.train()

You can find more examples in the documentation and in the examples.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

optimum-1.20.0.tar.gz (324.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

optimum-1.20.0-py3-none-any.whl (418.4 kB view details)

Uploaded Python 3

File details

Details for the file optimum-1.20.0.tar.gz.

File metadata

  • Download URL: optimum-1.20.0.tar.gz
  • Upload date:
  • Size: 324.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for optimum-1.20.0.tar.gz
Algorithm Hash digest
SHA256 b64c7536fe738db9b56605105efe72006401ad2aa00cb499ae407f2e06f3043b
MD5 4fc088dcf991f4270a120865d42f4096
BLAKE2b-256 430213375a95e8a52f46dddd69ab5a9a4e8f933670c56212b83171a5d482f76c

See more details on using hashes here.

File details

Details for the file optimum-1.20.0-py3-none-any.whl.

File metadata

  • Download URL: optimum-1.20.0-py3-none-any.whl
  • Upload date:
  • Size: 418.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for optimum-1.20.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0c0d0746043c95e22cf3586946d7408d353f10c0486f1c7d2d11084a5cfc0ede
MD5 031e22261f84b9060ec2a9f7a8de9356
BLAKE2b-256 0dcdabee9bc3790118d198339ab63a7a93b172a907777a067183628de0071e42

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page