Skip to main content

subs2vec module

Project description

subs2vec

Van Paridon & Thompson (2019) introduces pretrained embeddings and precomputed word/bigram/trigram frequencies in 55 languages. The files can be downloaded from the links in this table. Word vectors trained on subtitles are available, as well as vectors trained on Wikipedia, and a combination of subtitles and Wikipedia (for best predictive performance).

This repository contains the subs2vec module, a number of Python 3.7 scripts and command line tools to evaluate a set of word vectors on semantic similarity, semantic and syntactic analogy, and lexical norm prediction tasks. In addition, the subs2vec.py script will take an OpenSubtitles archive or Wikipedia and go through all the steps to train a fastText model and produce word vectors as used in the paper associated with this repository.

Psycholinguists may be especially interested norms script, which evaluates the lexical norm prediction performance of a set of word vectors, but can also be used to predict lexical norms for un-normed words. For a more detailed explanation see the how to use -> extending lexical norms section.

The scripts in this repository require Python 3.7 and some additional libraries that are easily installed through pip. (If you want to use the subs2vec.py script to train your own word embeddings, you will also need compiled fastText and word2vec binaries.)

If you use any of the subs2vec code and/or pretrained models, please cite the preprint (Van Paridon & Thompson, 2019).

How to use

subs2vec is available through pip, installing is as easy as running:
python3 -m pip install subs2vec
Any missing dependencies should be installed automatically.

Each submodules of subs2vec can then be run as a command line tool using the -m flag:
python3 -m subs2vec.submodule_name

Evaluating word embeddings

To evaluate word embeddings on analogies, semantic similarity, or lexical norm prediction as in Van Paridon & Thompson (2019), use:
python3 -m subs2vec.analogies fr french_word_vectors.vec
python3 -m subs2vec.similarities fr french_word_vectors.vec
python3 -m subs2vec.norms fr french_word_vectors.vec
subs2vec uses the two-letter ISO language codes, so French in the example is fr, English would be en, German would be de, etc.

All datasets used for evaluation, including the lexical norms, are stored in subs2vec/evaluation/datasets/.
Results from Van Paridon & Thompson (2019) are in subs2vec/evaluation/article_results/.

Extending lexical norms

To extend lexical norms (either norms you have collected yourself, or norms provided in this repository) use:
python3 -m subs2vec.norms fr french_word_vectors.vec --extend_norms=french_norms_file.txt

The norms file should be a tab-separated text file, with the first line containing column names and the column containing the words should be called word. Unobserved cells should be left empty. If you are unsure how to generate this file, you can create your list in Excel and then use Save as... tab-delimited text.
For an overview of norms that come included in the repo (and their authors), see this list. For the norms datasets themselves, look inside this directory.

Extracting word frequencies

The subtitle corpus used to train subs2vec was also used to compile the word frequencies in SUBTLEX. That same corpus can of course be used to compile bigram and trigram frequencies as well.
To extract word, bigram, or trigram frequencies from a text file yourself, fr.txt for instance, use:
python3 -m subs2vec.frequencies fr.txt

In general, however, we recommend downloading the precompiled frequencies files from [language archive] and looking frequencies up in those.
When looking up frequencies for specific words, bigrams, or trigrams, you may find that you cannot open the frequencies files (they can be very large). To retrieve items of interest use:
python3 -m subs2vec.lookup frequencies_file.tsv list_of_items.txt
Your list of items should be a simple text file, with each item you want to look up on its own line.
This lookup scripts works for looking up frequencies, but it finds lines in any plain text file, so it works for looking up word vectors in .vec files as well.

Removing duplicate lines

subs2vec comes with a module that removes duplicate lines from text files. We used it to remove duplicate lines from training corpora, but it works for any text file.
To remove duplicates from fr.txt for example, use:
python3 -m subs2vec.deduplicate fr.txt

Training models

If you want to reproduce models as used in Van Paridon & Thompson (2019), you can use the train_model module.
For instance, the steps to create a subtitle corpus are:

  1. Download a corpus:
    python3 -m subs2vec.download fr subs
  2. Clean the corpus:
    python3 -m subs2vec.clean_subs fr --strip --join
  3. Deduplicate the lines in the corpus:
    python3 -m subs2vec.deduplicate fr.txt
  4. Train a fastText model on the subtitle corpus:
    python3 -m subs2vec.train_model fr subs dedup.fr.txt
    This last step requires the binaries for fastText and word2phrase (part of word2vec) to be downloaded, built, and discoverable on your system (i.e., on your PATH).

For more detailed training options:
python3 -m subs2vec.train_model --help

API

For more detailed documentation of the package modules and API, see subs2vec.readthedocs.io

Downloading datasets

This table contains links to the top 1 million word vectors in each language, as well all vectors, model binaries, and the word, bigram, and trigram frequencies in the subtitle and Wikipedia corpora. If you use these pretrained vectors/models, please cite the preprint (Van Paridon & Thompson, 2019).

language lang corpus vectors corpus word count ngram frequencies
Afrikaans af OpenSubtitles top 1M vectors
all vectors
model binary
323K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
17M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Arabic ar OpenSubtitles top 1M vectors
all vectors
model binary
188M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
119M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Bulgarian bg OpenSubtitles top 1M vectors
all vectors
model binary
246M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
53M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Bengali bn OpenSubtitles top 1M vectors
all vectors
model binary
2227K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
18M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Breton br OpenSubtitles top 1M vectors
all vectors
model binary
110K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
7644K word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Bosnian bs OpenSubtitles top 1M vectors
all vectors
model binary
91M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
13M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Catalan ca OpenSubtitles top 1M vectors
all vectors
model binary
3098K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
175M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Czech cs OpenSubtitles top 1M vectors
all vectors
model binary
249M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
100M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Danish da OpenSubtitles top 1M vectors
all vectors
model binary
87M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
56M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
German de OpenSubtitles top 1M vectors
all vectors
model binary
139M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
976M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Greek el OpenSubtitles top 1M vectors
all vectors
model binary
271M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
58M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
English en OpenSubtitles top 1M vectors
all vectors
model binary
750M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
2477M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Esperanto eo OpenSubtitles top 1M vectors
all vectors
model binary
381K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
37M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Spanish es OpenSubtitles top 1M vectors
all vectors
model binary
514M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
585M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Estonian et OpenSubtitles top 1M vectors
all vectors
model binary
60M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
29M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Basque eu OpenSubtitles top 1M vectors
all vectors
model binary
3400K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
20M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Farsi fa OpenSubtitles top 1M vectors
all vectors
model binary
45M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
86M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Finnish fi OpenSubtitles top 1M vectors
all vectors
model binary
116M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
73M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
French fr OpenSubtitles top 1M vectors
all vectors
model binary
335M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
724M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Galician gl OpenSubtitles top 1M vectors
all vectors
model binary
1666K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
40M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Hebrew he OpenSubtitles top 1M vectors
all vectors
model binary
169M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
132M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Hindi hi OpenSubtitles top 1M vectors
all vectors
model binary
695K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
31M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Croatian hr OpenSubtitles top 1M vectors
all vectors
model binary
241M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
42M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Hungarian hu OpenSubtitles top 1M vectors
all vectors
model binary
227M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
120M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Armenian hy OpenSubtitles top 1M vectors
all vectors
model binary
23K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
38M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Indonesian id OpenSubtitles top 1M vectors
all vectors
model binary
65M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
69M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Icelandic is OpenSubtitles top 1M vectors
all vectors
model binary
7474K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
7196K word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Italian it OpenSubtitles top 1M vectors
all vectors
model binary
277M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
476M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Georgian ka OpenSubtitles top 1M vectors
all vectors
model binary
1108K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
15M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Kazakh kk OpenSubtitles top 1M vectors
all vectors
model binary
13K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
18M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Korean ko OpenSubtitles top 1M vectors
all vectors
model binary
6834K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
62M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Lithuanian lt OpenSubtitles top 1M vectors
all vectors
model binary
6252K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
23M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Latvian lv OpenSubtitles top 1M vectors
all vectors
model binary
2167K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
13M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Macedonian mk OpenSubtitles top 1M vectors
all vectors
model binary
20M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
26M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Malayalam ml OpenSubtitles top 1M vectors
all vectors
model binary
1520K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
10M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Malay ms OpenSubtitles top 1M vectors
all vectors
model binary
12M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
28M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Dutch nl OpenSubtitles top 1M vectors
all vectors
model binary
264M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
248M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Norwegian no OpenSubtitles top 1M vectors
all vectors
model binary
45M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
90M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Poland pl OpenSubtitles top 1M vectors
all vectors
model binary
250M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
232M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Portuguese pt OpenSubtitles top 1M vectors
all vectors
model binary
257M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
238M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Romanian ro OpenSubtitles top 1M vectors
all vectors
model binary
434M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
65M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Russian ru OpenSubtitles top 1M vectors
all vectors
model binary
152M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
390M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Sinhala si OpenSubtitles top 1M vectors
all vectors
model binary
3493K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
5980K word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Slovak sk OpenSubtitles top 1M vectors
all vectors
model binary
47M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
28M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Slovene sl OpenSubtitles top 1M vectors
all vectors
model binary
106M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
31M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Albanian sq OpenSubtitles top 1M vectors
all vectors
model binary
11M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
17M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Serbian sr OpenSubtitles top 1M vectors
all vectors
model binary
343M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
69M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Swedish sv OpenSubtitles top 1M vectors
all vectors
model binary
101M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
143M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Tamil ta OpenSubtitles top 1M vectors
all vectors
model binary
123K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
17M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Telugu te OpenSubtitles top 1M vectors
all vectors
model binary
103K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
15M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Tagalog tl OpenSubtitles top 1M vectors
all vectors
model binary
87K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
6515K word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Turkish tr OpenSubtitles top 1M vectors
all vectors
model binary
239M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
54M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Ukrainian uk OpenSubtitles top 1M vectors
all vectors
model binary
4945K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
162M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Urdu ur OpenSubtitles top 1M vectors
all vectors
model binary
195K word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
15M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary
Vietnamese vi OpenSubtitles top 1M vectors
all vectors
model binary
27M word counts
bigram counts
trigram counts
Wikipedia top 1M vectors
all vectors
model binary
115M word counts
bigram counts
trigram counts
Wikipedia + OpenSubtitles top 1M vectors
all vectors
model binary

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for subs2vec, version 0.9.3
Filename, size File type Python version Upload date Hashes
Filename, size subs2vec-0.9.3-py3-none-any.whl (3.4 MB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size subs2vec-0.9.3.tar.gz (3.3 MB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page