Skip to main content

Tools for downloading, processing, and training word2vec models on Google Ngrams

Project description

hist_w2v: Tools for Training Word2Vec Models on Google Ngrams

Version 0.1.0

I wanted to study the evolution of group stereotypes over time using Google Ngrams corpora, but wasn't satisfied with the existing tools I found online. So, I created a Python package to streamline the process of (1) downloading and pre-processing raw ngrams and (2) training and evaluating word2vec models on the ngrams. After installing, the best way to learn how to use these tools is to work through the provided Jupyter Notebook workflows.

Package Contents

The library consists of the following modules and notebooks:

src/ngram_tools

  1. downoad_ngrams.py: downloads the desired ngram types (e.g., 3-grams with part-of-speech [POS] tags, 5-grams without POS tags).
  2. convert_to_jsonl.py: converts the raw-text ngrams from Google into a more flexible JSONL format.
  3. lowercase_ngrams.py: makes the ngrams all lowercase.
  4. lemmatize_ngrams.py: lemmatizes the ngrams (i.e., reduce them to their base grammatical forms).
  5. filter_ngrams.py: screens out undesired tokens (e.g., stop words, numbers, words not in a vocabulary file) from the ngrams.
  6. sort_ngrams.py: combines multiple ngrams files into a single sorted file.
  7. consolidate_ngrams.py: consolidates duplicate ngrams resulting from the previous steps.
  8. index_and_create_vocabulary.py: numerically indexes a list of unigrams and create a "vocabulary file" to screen multigrams.
  9. create_yearly_files.py: splits the master corpus into yearly sub-corpora.
  10. helpers/file_handler.py: helper script to simplify reading and writing files in the other modules.
  11. helpers/print_jsonl_lines.py: helper script to view a snippet of ngrams in a JSONL file.
  12. helpers/verify_sort.py: helper script to confirm whether an ngram file is properly sorted.

src/training_tools

  1. train_ngrams.py: train word2vec models on pre-processed multigram corpora.
  2. evaluate_models.py: evaluate training quality on intrinsic benchmarks (i.e., similarity and analogy tests).
  3. plotting.py: plot various types of model results.

notebooks

  1. workflow_unigrams.ipynb: Jupyter Notebook showing how to download and preprocess unigrams.
  2. workflow_multigrams.ipynb: Jupyter Notebook showing how to download and preprocess multigrams.
  3. workflow_training.ipynb: Jupyter Notebook showing how to train, evaluate, and plots results from word2vec models.

Finally, the training_results folder is where a file containing evaluation metrics for a set of models is stored.

System Requirements

Unless you have an extremely powerful personal computer, the code is probably only suitable to run on a high-performance computing (HPC) cluster; efficiently downloading, processing, and training models on ngrams in parallel takes lots of processors and memory. On my university's HPC, I typically request 14 cores and 128G of RAM. A priority for future development is to streamline the code for individual systems.

Citing hist_w2v

If you use hist_w2v in your research or other publications, I kindly ask you to cite it. Used the GitHub citation to create citation text.

License

This project is released under the MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hist_w2v-0.1.0.tar.gz (37.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hist_w2v-0.1.0-py3-none-any.whl (53.9 kB view details)

Uploaded Python 3

File details

Details for the file hist_w2v-0.1.0.tar.gz.

File metadata

  • Download URL: hist_w2v-0.1.0.tar.gz
  • Upload date:
  • Size: 37.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.16

File hashes

Hashes for hist_w2v-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6eef873222524bd39b2c1d6bdbe081dc9a2e4507b6ff7994c4ce01fddfd4cdf2
MD5 a1c2a4434e906e3c61276f16d9225bb0
BLAKE2b-256 ccecece99249c241cbf2b2a11a2f7c25773d00714e83c721a5f7ccaba2621876

See more details on using hashes here.

File details

Details for the file hist_w2v-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: hist_w2v-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 53.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.16

File hashes

Hashes for hist_w2v-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f1faa649ab06dbe32670510009df6288b74be35e2b8335249bfe9da08a837828
MD5 42805f2a4d917123fcc330dcd7084eff
BLAKE2b-256 cb0f39c47b049e6d07abc7be9e3ae2b4d62853938ee089d01c2c63caac467d16

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page