Skip to main content

Tools for downloading, processing, and training word2vec models on Google Ngrams

Project description

hist_w2v: Tools for downloading, processing, and training word2vec models on Google Ngrams

I wanted to study the evolution of group stereotypes over time using Google Ngrams corpora, but wasn't satisfied with the existing tools I found online. So, I created a Python package to streamline the process of (1) downloading and pre-processing raw ngrams and (2) training and evaluating word2vec models on the ngrams. After installing, the best way to learn how to use these tools is to work through the provided Jupyter Notebook workflows.

Package Contents

The library consists of the following modules and notebooks:

src/ngram_tools

  1. downoad_ngrams.py: downloads the desired ngram types (e.g., 3-grams with part-of-speech [POS] tags, 5-grams without POS tags).
  2. convert_to_jsonl.py: converts the raw-text ngrams from Google into a more flexible JSONL format.
  3. lowercase_ngrams.py: makes the ngrams all lowercase.
  4. lemmatize_ngrams.py: lemmatizes the ngrams (i.e., reduce them to their base grammatical forms).
  5. filter_ngrams.py: screens out undesired tokens (e.g., stop words, numbers, words not in a vocabulary file) from the ngrams.
  6. sort_ngrams.py: combines multiple ngrams files into a single sorted file.
  7. consolidate_ngrams.py: consolidates duplicate ngrams resulting from the previous steps.
  8. index_and_create_vocabulary.py: numerically indexes a list of unigrams and create a "vocabulary file" to screen multigrams.
  9. create_yearly_files.py: splits the master corpus into yearly sub-corpora.
  10. helpers/file_handler.py: helper script to simplify reading and writing files in the other modules.
  11. helpers/print_jsonl_lines.py: helper script to view a snippet of ngrams in a JSONL file.
  12. helpers/verify_sort.py: helper script to confirm whether an ngram file is properly sorted.

src/training_tools

  1. train_ngrams.py: train word2vec models on pre-processed multigram corpora.
  2. evaluate_models.py: evaluate training quality on intrinsic benchmarks (i.e., similarity and analogy tests).
  3. plotting.py: plot various types of model results.

notebooks

  1. workflow_unigrams.ipynb: Jupyter Notebook showing how to download and preprocess unigrams.
  2. workflow_multigrams.ipynb: Jupyter Notebook showing how to download and preprocess multigrams.
  3. workflow_training.ipynb: Jupyter Notebook showing how to train, evaluate, and plots results from word2vec models.

Finally, the training_results folder is where a file containing evaluation metrics for a set of models is stored.

System Requirements

Unless you have an very powerful personal computer, the code is lilely only suitable to run on a high-performance computing (HPC) cluster; efficiently downloading, processing, and training models on ngrams in parallel takes lots of processors and memory. On my university's HPC, I typically request 14 cores and 128G of RAM. A priority for development is refactoring the code for individual systems.

Citing hist_w2v

If you use hist_w2v in your research or other publications, I kindly ask you to cite it. Use the GitHub citation to create citation text.

License

This project is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hist_w2v-0.1.1.tar.gz (36.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hist_w2v-0.1.1-py3-none-any.whl (53.4 kB view details)

Uploaded Python 3

File details

Details for the file hist_w2v-0.1.1.tar.gz.

File metadata

  • Download URL: hist_w2v-0.1.1.tar.gz
  • Upload date:
  • Size: 36.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for hist_w2v-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4d979c981af60af630d7694af109f8d4af97c6fbd26459f852d252bbf6d63d5d
MD5 1f908675e323e763424f6faedd244bfa
BLAKE2b-256 b3a6fb0f14f8652c5ff00a8c5c72999a06666a95c03e09cb5bbe8ff3c03dc543

See more details on using hashes here.

Provenance

The following attestation bundles were made for hist_w2v-0.1.1.tar.gz:

Publisher: publish-to-test-pypi.yml on eric-d-knowles/hist_w2v

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hist_w2v-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: hist_w2v-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 53.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for hist_w2v-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1e54771821162c175782b36ae56952ae82535a5985f151881bb013df218e21ec
MD5 21b5e1fe2866f77154e3742071bb0fa6
BLAKE2b-256 8f6b5cea34cef6c7877e42676e9fc2ff49e794b514148ed4fc212d5c36d5a5a6

See more details on using hashes here.

Provenance

The following attestation bundles were made for hist_w2v-0.1.1-py3-none-any.whl:

Publisher: publish-to-test-pypi.yml on eric-d-knowles/hist_w2v

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page