Optimum Habana is the interface between the Hugging Face Transformers and Diffusers libraries and Habana's Gaudi processor (HPU). It provides a set of tools enabling easy model loading, training and inference on single- and multi-HPU settings for different downstream tasks.

These details have not been verified by PyPI

Project links

Homepage

Project description

Optimum for Intel® Gaudi® Accelerators

Optimum for Intel Gaudi - a.k.a. optimum-habana - is the interface between the Transformers and Diffusers libraries and Intel Gaudi AI Accelerators (HPU). It provides a set of tools enabling easy model loading, training and inference on single- and multi-HPU settings for different downstream tasks. The list of officially validated models and tasks is available here. Users can try other of the thousands of Hugging Face models on Intel Gaudi accelerators and tasks with only few changes.

What are Intel Gaudi AI Accelerators (HPUs)?

HPUs offer fast model training and inference as well as a great price-performance ratio. Check out this blog post about BLOOM inference and this post benchmarking Intel Gaudi 2 and NVIDIA A100 GPUs for BridgeTower training for concrete examples.

Gaudi Setup

Please refer to the Intel Gaudi AI Accelerator official installation guide.

[!NOTE] Tests should be run in a Docker container based on Intel Gaudi's official images. Instructions to obtain the latest containers from the Intel Gaudi Vault are available here. The current Optimum for Intel Gaudi has been validated with Intel Gaudi v1.24 stack.

Install the library and get example scripts

Option 1: Use the latest stable release

To install the latest stable release of this package

pip install --upgrade-strategy eager optimum[habana]

The --upgrade-strategy eager option is needed to ensure optimum-habana is upgraded to the latest stable release.

To use the example associated with the latest stable release, run:

git clone https://github.com/huggingface/optimum-habana
cd optimum-habana && git checkout v1.21.0

with v1.21.0 being the latest Optimum for Intel Gaudi release version.

Option 2: Use the latest main branch under development

Optimum for Intel Gaudi is a fast-moving project, and you may want to install it from source and get the latest scripts :

pip install git+https://github.com/huggingface/optimum-habana.git
git clone https://github.com/huggingface/optimum-habana

Option 3: Use the `transformers_future` branch to have the latest changes from Transformers

The transformers_future branch is regularly updated with the latest changes from the main branches of Optimum for Intel Gaudi and Transformers. This enables you to try out new Transformers features that have not been merged into the main branch yet.

[!WARNING] The transformers_future branch may have some regressions or bugs and may be less stable than the main branch.

pip install git+https://github.com/huggingface/optimum-habana.git@transformers_future
git clone -b transformers_future https://github.com/huggingface/optimum-habana

Install Dependencies

To use DeepSpeed on HPUs, you also need to run the following command:

pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.23.0

To install the requirements for every example:

cd <example-folder>
pip install -r requirements.txt

How to use it?

Optimum for Intel Gaudi was designed with one goal in mind: to make training and inference straightforward for Transformers and Diffusers users, while fully leveraging the power of Intel Gaudi AI Accelerators.

Transformers Interface

There are two main classes one needs to know:

GaudiTrainer: the trainer class that takes care of compiling and distributing the model to run on HPUs, and performing training and evaluation.
GaudiConfig: the class that enables to configure Gaudi Mixed Precision and to decide whether optimized operators and optimizers should be used or not.

The GaudiTrainer is very similar to the Transformers Trainer, and adapting a script using the Trainer to make it work with Intel Gaudi accelerators will mostly consist in simply swapping the Trainer class for the GaudiTrainer one.

That's how most of the example scripts were adapted from their original counterparts.

Here is an example:

- from transformers import Trainer, TrainingArguments
+ from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments

- training_args = TrainingArguments(
+ training_args = GaudiTrainingArguments(
  # training arguments...
+ use_habana=True,
+ use_lazy_mode=True,  # whether to use lazy or eager mode
+ gaudi_config_name=path_to_gaudi_config,
)

# A lot of code here

# Initialize our Trainer
- trainer = Trainer(
+ trainer = GaudiTrainer(
    model=model,
    args=training_args,  # Original training arguments.
    train_dataset=train_dataset if training_args.do_train else None,
    eval_dataset=eval_dataset if training_args.do_eval else None,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

where gaudi_config_name is the name of a model from the Hub (Intel Gaudi configurations are stored in model repositories) or a path to a local Intel Gaudi configuration file (you can see here how to write your own).

Diffusers Interface

You can generate images from prompts using Stable Diffusion on Intel Gaudi using the GaudiStableDiffusionPipeline class and the GaudiDDIMScheduler class which have been both optimized for HPUs. Here is how to use them and the differences with the Diffusers library:

- from diffusers import DDIMScheduler, StableDiffusionPipeline
+ from optimum.habana.diffusers import GaudiDDIMScheduler, GaudiStableDiffusionPipeline


model_name = "CompVis/stable-diffusion-v1-4"

- scheduler = DDIMScheduler.from_pretrained(model_name, subfolder="scheduler")
+ scheduler = GaudiDDIMScheduler.from_pretrained(model_name, subfolder="scheduler")

- pipeline = StableDiffusionPipeline.from_pretrained(
+ pipeline = GaudiStableDiffusionPipeline.from_pretrained(
    model_name,
    scheduler=scheduler,
+   use_habana=True,
+   use_hpu_graphs=True,
+   gaudi_config="Habana/stable-diffusion",
)

outputs = pipeline(
    ["An image of a squirrel in Picasso style"],
    num_images_per_prompt=16,
+   batch_size=4,
)

Important Note on Pytorch 2.5 Performance Degradation

With the upgrade to PyTorch 2.5, users may experience some performance degradation due to changes in the handling of FP16/BF16 inputs. The note from PyTorch 2.5 states:

"A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul."

For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with the following setting:

torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)

Additionally, the next release of Optimum Habana will include a Gaudi-specific safe_softmax implementation that will also improve performance.

More info:

https://pytorch.org/docs/stable/notes/numerical_accuracy.html

Documentation

Check out the documentation of Optimum for Intel Gaudi for more advanced usage.

Validated Models

The following model architectures, tasks and device distributions have been validated for Optimum for Intel Gaudi:

[!NOTE] In the tables below, :heavy_check_mark: means single-card, multi-card and DeepSpeed have all been validated.

Transformers:

Architecture	Training	Inference	Tasks
BERT	:heavy_check_mark:	:heavy_check_mark:	text classification question answering language modeling text feature extraction
RoBERTa	:heavy_check_mark:	:heavy_check_mark:	question answering language modeling
ALBERT	:heavy_check_mark:	:heavy_check_mark:	question answering language modeling
DistilBERT	:heavy_check_mark:	:heavy_check_mark:	question answering language modeling
GPT2	:heavy_check_mark:	:heavy_check_mark:	language modeling text generation
BLOOM(Z)		DeepSpeed	text generation
StarCoder / StarCoder2	:heavy_check_mark:	Single-card	language modeling text generation
GPT-J	DeepSpeed	Single card DeepSpeed	language modeling text generation
GPT-Neo		Single card	text generation
GPT-NeoX	DeepSpeed	DeepSpeed	language modeling text generation
OPT		DeepSpeed	text generation
Llama 2 / CodeLlama / Llama 3 / Llama Guard / Granite	:heavy_check_mark:	:heavy_check_mark:	language modeling text generation question answering text classification (Llama Guard)
StableLM		Single card	text generation
Falcon	LoRA	:heavy_check_mark:	language modeling text generation
CodeGen		Single card	text generation
MPT		Single card	text generation
Mistral		Single card	text generation
Phi	:heavy_check_mark:	Single card	language modeling text generation
Mixtral		Single card	text generation
Persimmon		Single card	text generation
Qwen2 / Qwen3	Single card	Single card	language modeling text generation
Qwen2-MoE		Single card	text generation
Gemma	:heavy_check_mark:	Single card	language modeling text generation
Gemma2		:heavy_check_mark:	text generation
Gemma3		:heavy_check_mark:	text generation
XGLM		Single card	text generation
Cohere		Single card	text generation
T5 / Flan T5	:heavy_check_mark:	:heavy_check_mark:	summarization translation question answering
BART		Single card	summarization translation question answering
ViT	:heavy_check_mark:	:heavy_check_mark:	image classification
Swin	:heavy_check_mark:	:heavy_check_mark:	image classification
Wav2Vec2	:heavy_check_mark:	:heavy_check_mark:	audio classification speech recognition
Whisper	:heavy_check_mark:	:heavy_check_mark:	speech recognition
SpeechT5		Single card	text to speech
CLIP	:heavy_check_mark:	:heavy_check_mark:	contrastive image-text training
BridgeTower	:heavy_check_mark:	:heavy_check_mark:	contrastive image-text training
ESMFold		Single card	protein folding
Blip		Single card	visual question answering image to text
OWLViT		Single card	zero shot object detection
ClipSeg		Single card	object segmentation
Llava / Llava-next / Llava-onevision		Single card	image to text
idefics2	LoRA	Single card	image to text
Paligemma		Single card	image to text
Segment Anything Model		Single card	object segmentation
VideoMAE		Single card	Video classification
TableTransformer		Single card	table object detection
DETR		Single card	object detection
Mllama	LoRA	:heavy_check_mark:	image to text
MiniCPM3		Single card	text generation
Baichuan2	DeepSpeed .	Single card	language modeling text generation
DeepSeek-V2	:heavy_check_mark:	:heavy_check_mark:	text generation
DeepSeek-V3 / Moonlight		:heavy_check_mark:	text generation
ChatGLM	DeepSpeed	Single card	language modeling text generation
Qwen2-VL		Single card	image to text
Qwen2.5-VL		Single card	image to text
VideoLLaVA		Single card	Video comprehension
GLM-4V		Single card	image to text
Arctic		DeepSpeed	text generation
GPT-OSS		DeepSpeed	text generation

Diffusers:

Architecture	Training	Inference	Tasks
Stable Diffusion	:heavy_check_mark:	:heavy_check_mark:	text-to-image generation image-to-image generation
Stable Diffusion XL	:heavy_check_mark:	:heavy_check_mark:	text-to-image generation image-to-image generation
Stable Diffusion Depth2img		Single card	depth-to-image generation
Stable Diffusion 3	:heavy_check_mark:	:heavy_check_mark:	text-to-image generation
LDM3D		Single card	text-to-image generation
FLUX.1	LoRA	Single card	text-to-image generation image-to-image generation
Qwen Image		Single card	text-to-image generation
Text to Video		Single card	text-to-video generation
Image to Video		Single card	image-to-video generation
i2vgen-xl		Single card	image-to-video generation
Wan		:heavy_check_mark:	text-to-video generation image-to-video generation

PyTorch Image Models/TIMM:

Architecture	Training	Inference	Tasks
FastViT		Single card	image classification

TRL:

Architecture	Training	Tasks
Llama 2	:heavy_check_mark:	DPO Pipeline
Llama 2	:heavy_check_mark:	PPO Pipeline
Stable Diffusion	:heavy_check_mark:	DDPO Pipeline

Other models and tasks supported by the Transformers and Diffusers libraries may also work. You can refer to this section for using them with Optimum for Intel Gaudi. In addition, this page explains how to modify any example from the Transformers library to make it work with Optimum for Intel Gaudi.

If you find any issues while using those, please open an issue or a pull request.

After training your model, feel free to submit it to the Intel leaderboard which is designed to evaluate, score, and rank open-source LLMs that have been pre-trained or fine-tuned on Intel Hardwares. Models submitted to the leaderboard will be evaluated on the Intel Developer Cloud. The evaluation platform consists of Gaudi Accelerators and Xeon CPUs running benchmarks from the Eleuther AI Language Model Evaluation Harness.

The list of validated models through continuous integration tests is posted here

Development

Check the contributor guide for instructions.

Known Issues

bitsandbytes compatibility issues with PyTorch >= 2.10

Users running PyTorch>=2.10 with bitsandbytes may encounter issues depending on the bitsandbytes version in use. This is an upstream problem tracked at https://github.com/bitsandbytes-foundation/bitsandbytes/issues/1904 and is not specific to Intel Gaudi and Optimum-Habana.

bitsandbytes >= 0.50: degraded performance with `torch.compile`

When running quantized workloads with bitsandbytes>=0.50 and torch.compile on PyTorch>=2.10, performance may regress.

Workarounds:
- Run without torch.compile.
- Pin bitsandbytes==0.49.2 and reduce batch size (see caveat below).

bitsandbytes < 0.50: increased memory usage due to graph breaks

When running quantized workloads with bitsandbytes<0.50 (e.g. bitsandbytes==0.49.2) on PyTorch>=2.10, graph breaks may occur and cause increased memory consumption.

Workarounds:
- Run without torch.compile.
- Reduce batch size.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.21.0

Apr 16, 2026

1.20.0

Jan 14, 2026

1.19.1

Oct 21, 2025

1.19.0

Sep 5, 2025

1.18.1

Jul 24, 2025

1.18.0

Jun 12, 2025

1.17.0

Apr 11, 2025

1.16.0

Mar 10, 2025

1.15.0

Dec 25, 2024

1.14.1

Oct 29, 2024

1.14.0

Oct 22, 2024

1.13.2

Sep 6, 2024

1.13.1

Aug 25, 2024

1.13.0

Aug 16, 2024

1.12.1

Jul 11, 2024

1.12.0

Jun 22, 2024

1.11.1

Apr 20, 2024

1.11.0

Apr 4, 2024

1.10.4

Feb 23, 2024

1.10.2

Feb 18, 2024

1.10.0

Jan 30, 2024

1.9.0

Nov 30, 2023

1.8.2

Nov 24, 2023

1.8.1

Nov 2, 2023

1.8.0

Oct 19, 2023

1.7.5

Sep 14, 2023

1.7.4

Sep 12, 2023

1.7.3

Sep 8, 2023

1.7.2

Aug 24, 2023

1.7.1

Aug 23, 2023

1.7.0

Aug 17, 2023

1.6.1

Jul 7, 2023

1.6.0

Jun 26, 2023

1.5.1

May 11, 2023

1.5.0

Apr 17, 2023

1.4.2

Mar 16, 2023

1.4.1

Feb 13, 2023

1.4.0

Feb 12, 2023

1.3.3

Jan 30, 2023

1.3.2

Jan 24, 2023

1.3.1

Dec 2, 2022

1.3.0

Dec 1, 2022

1.2.3

Oct 13, 2022

1.2.2

Oct 2, 2022

1.2.1

Sep 12, 2022

1.2.0

Sep 12, 2022

1.1.2

Aug 12, 2022

1.1.1

Aug 2, 2022

1.1.0

Jul 15, 2022

1.0.1

Apr 26, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

optimum_habana-1.21.0.tar.gz (969.6 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

optimum_habana-1.21.0-py3-none-any.whl (1.1 MB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file optimum_habana-1.21.0.tar.gz.

File metadata

Download URL: optimum_habana-1.21.0.tar.gz
Upload date: Apr 16, 2026
Size: 969.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for optimum_habana-1.21.0.tar.gz
Algorithm	Hash digest
SHA256	`97233611f5b876b8ffd77348922070d7f4ef05506006fb435311d64e9bf76c89`
MD5	`5de69ec8e078ddfe452877245ecffe29`
BLAKE2b-256	`fdeb23fd1d4ca0cedf5cf946d2b2ccac2beaeeac9db1d0cfc758c02690828077`

See more details on using hashes here.

File details

Details for the file optimum_habana-1.21.0-py3-none-any.whl.

File metadata

Download URL: optimum_habana-1.21.0-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 1.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for optimum_habana-1.21.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`796caff4c28a282dd197a8a210f92b4a4a34423d02aead0911b256ecda70b1a9`
MD5	`eb4b9cfea9799c7d732ea00c4540e448`
BLAKE2b-256	`0e611e726bcc3c511c516b1698ddd1a0f253718f1a7faa44249358df333f2d87`

See more details on using hashes here.

optimum-habana 1.21.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Optimum for Intel® Gaudi® Accelerators

What are Intel Gaudi AI Accelerators (HPUs)?

Gaudi Setup

Install the library and get example scripts

Option 1: Use the latest stable release

Option 2: Use the latest main branch under development

Option 3: Use the transformers_future branch to have the latest changes from Transformers

Install Dependencies

How to use it?

Transformers Interface

Diffusers Interface

Important Note on Pytorch 2.5 Performance Degradation

Documentation

Validated Models

Transformers:

Diffusers:

PyTorch Image Models/TIMM:

TRL:

Development

Known Issues

bitsandbytes compatibility issues with PyTorch >= 2.10

bitsandbytes >= 0.50: degraded performance with torch.compile

bitsandbytes < 0.50: increased memory usage due to graph breaks

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Option 3: Use the `transformers_future` branch to have the latest changes from Transformers

bitsandbytes >= 0.50: degraded performance with `torch.compile`