Skip to main content

Tools for downloading, processing, and training word2vec models on Google Ngrams

Project description

hist_w2v

Tools for downloading, processing, and training word2vec models on Google Ngrams

Python package to assist researchers in using Google Ngrams to examine semantic change over years, decades, and centuries. hist_w2v automates downloading and pre-processing raw ngrams and training word2vec models on a corpus.

Installation

There are two ways to install hist_w2v:

  1. Clone the GitHub repository (https://github.com/eric-d-knowles/hist_w2v) to your Python environmen t.
  2. Install from PyPI.org by running pip install hist_w2v in your Python environment.

After installing hist_w2v, the best way to learn how to use it by working through the provided Jupyter Notebook workflows. Together, these notebooks provide a fully documented, end-to-end illustration of the package's functionality.

Package Contents

The library consists of the following modules and notebooks:

src/ngram_tools

  1. downoad_ngrams.py: downloads the desired ngram types (e.g., 3-grams with part-of-speech [POS] tags, 5-grams without POS tags).
  2. convert_to_jsonl.py: converts the raw-text ngrams from Google into a more flexible JSONL format.
  3. lowercase_ngrams.py: makes the ngrams all lowercase.
  4. lemmatize_ngrams.py: lemmatizes the ngrams (i.e., reduce them to their base grammatical forms).
  5. filter_ngrams.py: screens out undesired tokens (e.g., stop words, numbers, words not in a vocabulary file) from the ngrams.
  6. sort_ngrams.py: combines multiple ngrams files into a single sorted file.
  7. consolidate_ngrams.py: consolidates duplicate ngrams resulting from the previous steps.
  8. index_and_create_vocabulary.py: numerically indexes a list of unigrams and create a "vocabulary file" to screen multigrams.
  9. create_yearly_files.py: splits the master corpus into yearly sub-corpora.
  10. helpers/file_handler.py: helper script to simplify reading and writing files in the other modules.
  11. helpers/print_jsonl_lines.py: helper script to view a snippet of ngrams in a JSONL file.
  12. helpers/verify_sort.py: helper script to confirm whether an ngram file is properly sorted.

src/training_tools

  1. train_ngrams.py: train word2vec models on pre-processed multigram corpora.
  2. evaluate_models.py: evaluate training quality on intrinsic benchmarks (i.e., similarity and analogy tests).
  3. plotting.py: plot various types of model results.
  4. w2v_model.py: a Python class (W2VModel) to aid in the evaluation, normalization, and alignment of yearly word2vec models

notebooks

  1. workflow_unigrams.ipynb: Jupyter Notebook showing how to download and preprocess unigrams.
  2. workflow_multigrams.ipynb: Jupyter Notebook showing how to download and preprocess multigrams.
  3. workflow_training.ipynb: Jupyter Notebook showing how to train, evaluate, and plots results from word2vec models.

Finally, the training_results folder is where a file containing evaluation metrics for a set of models is stored.

System Requirements

Efficiently downloading, processing, and training models on ngrams takes lots of processors and memory. Unless you have a very powerful PC, you should only try to run hist_w2v on a high-performance computing (HPC) cluster or similar platform. On my university's HPC, I typically request 14 cores and 128G of RAM. A priority for development is refactoring the code for individual systems.

Citing hist_w2v

If you use hist_w2v in your research or other publications, I kindly ask you to cite it. Use the GitHub citation to create citation text.

License

This project is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hist_w2v-0.1.6.tar.gz (46.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hist_w2v-0.1.6-py3-none-any.whl (62.9 kB view details)

Uploaded Python 3

File details

Details for the file hist_w2v-0.1.6.tar.gz.

File metadata

  • Download URL: hist_w2v-0.1.6.tar.gz
  • Upload date:
  • Size: 46.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for hist_w2v-0.1.6.tar.gz
Algorithm Hash digest
SHA256 7357b3ce0c69b1b630331f81f8f6bb355c6fafb3a2babebfbc352061c2aeced4
MD5 8ffac0ebe87cf5390fcc5b3bec15702d
BLAKE2b-256 8ea8045f8f236186ee26da1fe28af5f8897c1e9515e78cc3be9e633c4026a819

See more details on using hashes here.

Provenance

The following attestation bundles were made for hist_w2v-0.1.6.tar.gz:

Publisher: publish-to-test-pypi.yml on eric-d-knowles/hist_w2v

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hist_w2v-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: hist_w2v-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 62.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for hist_w2v-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 ba2545b6ad2f14471bc8cf4901eddcfaaf31b5dde4e3a401bc9026a842abc58e
MD5 ccbe862698a25899a22eb49f75898f70
BLAKE2b-256 c642f2bafcd890141f06e588b6dd88a1d3a89e2dff1229e74462af451f2ea69e

See more details on using hashes here.

Provenance

The following attestation bundles were made for hist_w2v-0.1.6-py3-none-any.whl:

Publisher: publish-to-test-pypi.yml on eric-d-knowles/hist_w2v

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page