A set of tools to compress gensim fasttext models

# Compress-fastText

This Python 3 package allows to compress fastText word embedding models (from the gensim package) by orders of magnitude, without significantly affecting their quality.

Note: gensim==4.0.0 has introduced some backward-incompatible changes:

• With gensim<4.0.0, please use compress-fasttext<=0.0.7 (and optionally Russian models from our first release).
• With gensim>=4.0.0, please use compress-fasttext>=0.1.0 (and optionally Russian or English models from our 0.1.0 release).
• Some models are no longer supported in the new version of gensim+compress-fasttext (for example, multiple models from RusVectores that use compatible_hash=False).
• For any particular model, compatibility should be determined experimentally. If you notice any strange behaviour, please report in the Github issues.

The package can be installed with pip:

pip install compress-fasttext[full]


If you are not going to perform matrix decomposition or quantization, you can install a variety with less dependencies:

pip install compress-fasttext


This blogpost (in Russian) gives more details about the motivation and methods for compressing fastText models.

### Model compression

You can use this package to compress your own fastText model (or one downloaded e.g. from RusVectores):

Compress a model in Gensim format:

import gensim
import compress_fasttext
small_model = compress_fasttext.prune_ft_freq(big_model, pq=True)
small_model.save('path-to-new-model')


Import a model in Facebook original format and compress it:

from gensim.models.fasttext import load_facebook_model
import compress_fasttext
small_model = compress_fasttext.prune_ft_freq(big_model, pq=True)
small_model.save('path-to-new-model')


To perform this compression, you will need to pip install gensim==3.8.3 pqkmeans beforehand.

Different compression methods include:

• matrix decomposition (svd_ft)
• product quantization (quantize_ft)
• optimization of feature hashing (prune_ft)
• feature selection (prune_ft_freq)

The recommended approach is combination of feature selection and quantization (prune_ft_freq with pq=True).

### Model usage

If you just need a tiny fastText model for Russian, you can download this 21-megabyte model. It's a compressed version of geowac_tokens_none_fasttextskipgram_300_5_2020 model from RusVectores.

If compress-fasttext is already installed, you can download and use this tiny model

import compress_fasttext
)
print(small_model['спасибо'])
# [ 0.26762889  0.35489027 ...  -0.06149674] # a 300-dimensional vector
print(small_model.most_similar('котенок'))
# [('кот', 0.7391024827957153), ('пес', 0.7388300895690918), ('малыш', 0.7280327081680298), ... ]


The class CompressedFastTextKeyedVectors inherits from gensim.models.fasttext.FastTextKeyedVectors, but makes a few additional optimizations.

For English, you can use this tiny model, obtained by compressing the model by Facebook.

import compress_fasttext
)
print(small_model['hello'])
# [ 1.84736611e-01  6.32683930e-03  4.43901886e-03 ... -2.88431027e-02]  # a 300-dimensional vector
print(small_model.most_similar('Python'))
# [('PHP', 0.5252903699874878), ('.NET', 0.5027452707290649), ('Java', 0.4897131323814392),  ... ]


More compressed models for 101 various languages can be found at https://zenodo.org/record/4905385.

### Notes

This code is heavily based on the navec package by Alexander Kukushkin and the blogpost by Andrey Vasnetsov about shrinking fastText embeddings.

