Skip to main content

Tools for downloading, processing, and training word2vec models on Google Ngrams

Project description

hist_w2v: Tools for downloading, processing, and training word2vec models on Google Ngrams

This Python package is meant to help researchers use Google Ngrams to examine how words' meanings have changed over time. The tools assist with (1) downloading and pre-processing raw ngrams and (2) training word2vec models on a specified ngram corpus. After installing, the best way to learn how to use these tools is to work through the provided Jupyter Notebook workflows.

Package Contents

The library consists of the following modules and notebooks:

src/ngram_tools

  1. downoad_ngrams.py: downloads the desired ngram types (e.g., 3-grams with part-of-speech [POS] tags, 5-grams without POS tags).
  2. convert_to_jsonl.py: converts the raw-text ngrams from Google into a more flexible JSONL format.
  3. lowercase_ngrams.py: makes the ngrams all lowercase.
  4. lemmatize_ngrams.py: lemmatizes the ngrams (i.e., reduce them to their base grammatical forms).
  5. filter_ngrams.py: screens out undesired tokens (e.g., stop words, numbers, words not in a vocabulary file) from the ngrams.
  6. sort_ngrams.py: combines multiple ngrams files into a single sorted file.
  7. consolidate_ngrams.py: consolidates duplicate ngrams resulting from the previous steps.
  8. index_and_create_vocabulary.py: numerically indexes a list of unigrams and create a "vocabulary file" to screen multigrams.
  9. create_yearly_files.py: splits the master corpus into yearly sub-corpora.
  10. helpers/file_handler.py: helper script to simplify reading and writing files in the other modules.
  11. helpers/print_jsonl_lines.py: helper script to view a snippet of ngrams in a JSONL file.
  12. helpers/verify_sort.py: helper script to confirm whether an ngram file is properly sorted.

src/training_tools

  1. train_ngrams.py: train word2vec models on pre-processed multigram corpora.
  2. evaluate_models.py: evaluate training quality on intrinsic benchmarks (i.e., similarity and analogy tests).
  3. plotting.py: plot various types of model results.

notebooks

  1. workflow_unigrams.ipynb: Jupyter Notebook showing how to download and preprocess unigrams.
  2. workflow_multigrams.ipynb: Jupyter Notebook showing how to download and preprocess multigrams.
  3. workflow_training.ipynb: Jupyter Notebook showing how to train, evaluate, and plots results from word2vec models.

Finally, the training_results folder is where a file containing evaluation metrics for a set of models is stored.

System Requirements

Unless you have an very powerful personal computer, the code is lilely only suitable to run on a high-performance computing (HPC) cluster; efficiently downloading, processing, and training models on ngrams in parallel takes lots of processors and memory. On my university's HPC, I typically request 14 cores and 128G of RAM. A priority for development is refactoring the code for individual systems.

Citing hist_w2v

If you use hist_w2v in your research or other publications, I kindly ask you to cite it. Use the GitHub citation to create citation text.

License

This project is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hist_w2v-0.1.2.tar.gz (38.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hist_w2v-0.1.2-py3-none-any.whl (55.9 kB view details)

Uploaded Python 3

File details

Details for the file hist_w2v-0.1.2.tar.gz.

File metadata

  • Download URL: hist_w2v-0.1.2.tar.gz
  • Upload date:
  • Size: 38.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for hist_w2v-0.1.2.tar.gz
Algorithm Hash digest
SHA256 edbaad60d975f5962caf779978926496fdc3770d3b3c9b2e42091f9f0b688485
MD5 d389fb02031e4f9ae55bbb916513be3c
BLAKE2b-256 6684075f0dc9537517ca05defe748a6287d4cd529182e334449d1b5d23564535

See more details on using hashes here.

Provenance

The following attestation bundles were made for hist_w2v-0.1.2.tar.gz:

Publisher: publish-to-test-pypi.yml on eric-d-knowles/hist_w2v

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hist_w2v-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: hist_w2v-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 55.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for hist_w2v-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f9d810119fb12eebf6e1bb6038946bd3cc5f00521c8afc3cd4404be80c80ca67
MD5 a6596236ff0c02f2bfd7ee3fd0bf0e52
BLAKE2b-256 18f0cb8775557b1db1b2942ec82f26a0875dee03ed093f0ea87aaa753e1a0ac0

See more details on using hashes here.

Provenance

The following attestation bundles were made for hist_w2v-0.1.2-py3-none-any.whl:

Publisher: publish-to-test-pypi.yml on eric-d-knowles/hist_w2v

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page