Temporary remove unused tokens during training to save ram and speed.

These details have not been verified by PyPI

Project links

Project description

transformer-smaller-training-vocab

Docs are available here

Motivation

Have you ever trained a transformer model and noticed that most tokens in the vocab are not used? Logically the token embeddings from those terms won't change, however they still take up memory and compute resources on your GPU. One could assume that the embeddings are just a small part of the model and therefore aren't relevant, however looking at models like xlm-roberta-large have 45.72% of parameters as "word_embeddings". Besides that, the gradient computation is done for the whole embedding weight, leading to gradient updates with very large amounts of 0s, eating a lot of memory, especially with state optimizers such as adam.

To reduce these inconveniences, this package provides a simple and easy to use way to

gather usage statistics of the vocabulary
temporary reduce the vocabulary to include no tokens that won't be used during training
fit in the tokens back in after the training is finished, so the full version can be saved.

Limitations

This library works fine, if you use any FastTokenizer However if you want to use a slow tokenizer, it get's more tricky as huggingface-transformers has - per current date - no interface for overwriting the vocabulary in transformers. So they require a custom implementation, currently the following tokenizers are supported:

XLMRobertaTokenizer
RobertaTokenizer
BertTokenizer

If you want to use a tokenizer that is not on the list, please create an issue for it.

Quick Start

Requirements and Installation

The project is based on Transformers 4.1.0+, PyTorch 1.8+ and Python 3.8+ Then, in your favorite virtual environment, simply run:

pip install transformer-smaller-training-vocab

Example Usage

To use more efficient training, it is enough to do the following changes to an abitary training script:

  model = ...
  tokenizer = ...
  raw_datasets = ...
  ...

+ with reduce_train_vocab(model=model, tokenizer=tokenizer, texts=get_texts_from_dataset(raw_datasets, key="text")):
      def preprocess_function(examples):
          result = tokenizer(examples["text"], padding=padding, max_length=max_seq_length, truncation=True)
          result["label"] = [(label_to_id[l] if l != -1 else -1) for l in examples["label"]]
          return result
    
      raw_datasets = raw_datasets.map(
          preprocess_function,
          batched=True,
      )
    
      trainer = Trainer(
          model=model,
          train_dataset=raw_datasets["train"],
          eval_dataset=raw_datasets["validation"],
          tokenizer=tokenizer,
          ...
      )
    
      trainer.train()

+ trainer.save_model()  # save model at the end to contain the full vocab again.

Done! The Model will now be trained with only use the necessary parts of the token embeddings.

Impact

Here is a table to document how much impact this technique has on training:

Model	Dataset	Vocab reduction	Model size reduction
xlm-roberta-large	CONLL 03 (en)	93.13%	42.58%
xlm-roberta-base	CONLL 03 (en)	93.13%	64.31%
bert-base-cased	CONLL 03 (en)	43.64%	08.97%
bert-base-uncased	CONLL 03 (en)	47.62%	10.19%
bert-large-uncased	CONLL 03 (en)	47.62%	04.44%
roberta-base	CONLL 03 (en)	58.39%	18.08%
roberta-large	CONLL 03 (en)	58.39%	08.45%
bert-base-cased	cola	77.67%	15.97%
roberta-base	cola	86.08%	26.66%
xlm-roberta-base	cola	97.79%	67.52%
bert-base-cased	mnli	10.94%	2.25%
roberta-base	mnli	14.78%	4.58%
xlm-roberta-base	mnli	88.83%	61.34%
bert-base-cased	mrpc	49.93%	10.27%
roberta-base	mrpc	64.02%	19.83%
xlm-roberta-base	mrpc	94.88%	65.52%
bert-base-cased	qnli	8.62%	1.77%
roberta-base	qnli	17.64%	5.46%
xlm-roberta-base	qnli	87.57%	60.47%
bert-base-cased	qqp	7.69%	1.58%
roberta-base	qqp	5.91%	1.83%
xlm-roberta-base	qqp	85.40%	58.98%
bert-base-cased	rte	34.68%	7.13%
roberta-base	rte	50.49%	15.64%
xlm-roberta-base	rte	93.10%	64.29%
bert-base-cased	sst2	62.39%	12.83%
roberta-base	sst2	68.60%	21.25%
xlm-roberta-base	sst2	96.25%	66.47%
bert-base-cased	stsb	51.35%	10.56%
roberta-base	stsb	64.37%	19.93%
xlm-roberta-base	stsb	94.88%	65.52%
bert-base-cased	wnli	93.66%	19.26%
roberta-base	wnli	96.03%	29.74%
xlm-roberta-base	wnli	99.25%	68.54%

Notice that while those reduced embeddings imply slightly less computation effort, those gains are neglectable, as the gradient computation for the parameters of transformer layers are dominant.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.2

Jun 15, 2025

0.4.1

Apr 16, 2025

0.4.0

Apr 5, 2024

0.3.3

Dec 4, 2023

0.3.2

Sep 18, 2023

0.3.1

Aug 17, 2023

0.2.4

May 22, 2023

0.2.3

Mar 25, 2023

0.2.2

Mar 20, 2023

0.2.1

Mar 17, 2023

0.2.0

Feb 19, 2023

0.1.8

Feb 5, 2023

0.1.7

Jan 6, 2023

0.1.0

Jan 6, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transformer_smaller_training_vocab-0.4.2.tar.gz (11.8 kB view details)

Uploaded Jun 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

transformer_smaller_training_vocab-0.4.2-py3-none-any.whl (14.1 kB view details)

Uploaded Jun 15, 2025 Python 3

File details

Details for the file transformer_smaller_training_vocab-0.4.2.tar.gz.

File metadata

Download URL: transformer_smaller_training_vocab-0.4.2.tar.gz
Upload date: Jun 15, 2025
Size: 11.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.3 Linux/6.11.0-1015-azure

File hashes

Hashes for transformer_smaller_training_vocab-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`aed4390331e63d9a0998d76f277b7d44ad5d38b8427ad7792b1e28c807801ed8`
MD5	`3779c3cb24b52feb6d843537cd4e4d87`
BLAKE2b-256	`456f85142d145fd2c453053e7dcd5500c31cd26ce51f9010cf0fe698001853f2`

See more details on using hashes here.

File details

Details for the file transformer_smaller_training_vocab-0.4.2-py3-none-any.whl.

File metadata

Download URL: transformer_smaller_training_vocab-0.4.2-py3-none-any.whl
Upload date: Jun 15, 2025
Size: 14.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.3 Linux/6.11.0-1015-azure

File hashes

Hashes for transformer_smaller_training_vocab-0.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`49fcdb3134ede5faca41d3bed2588bd21a4098b64f261e7b198f163d394c3ef0`
MD5	`1a1ed4f952e5c9682da026200589bfe7`
BLAKE2b-256	`6c8894fa030995bc9a54e911172fd6ca26a81c2a5ddafd896ff62ad9cc99088b`

See more details on using hashes here.

transformer-smaller-training-vocab 0.4.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

transformer-smaller-training-vocab

Motivation

Limitations

Quick Start

Requirements and Installation

Example Usage

Impact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes