Tools for downloading, processing, and training word2vec models on Google Ngrams
Project description
hist_w2v
Tools for downloading, processing, and training word2vec models on Google Ngrams
Python package to assist researchers in using Google Ngrams to examine semantic change over years, decades, and centuries. hist_w2v automates downloading and pre-processing raw ngrams and training word2vec models on a corpus.
Installation
There are two ways to install hist_w2v:
- Clone the GitHub repository (https://github.com/eric-d-knowles/hist_w2v) to your Python environmen t.
- Install from PyPI.org by running
pip install hist_w2vin your Python environment.
After installing hist_w2v, the best way to learn how to use it by working through the provided Jupyter Notebook workflows. Together, these notebooks provide a fully documented, end-to-end illustration of the package's functionality.
Package Contents
The library consists of the following modules and notebooks:
src/ngram_tools
downoad_ngrams.py: downloads the desired ngram types (e.g., 3-grams with part-of-speech [POS] tags, 5-grams without POS tags).convert_to_jsonl.py: converts the raw-text ngrams from Google into a more flexible JSONL format.lowercase_ngrams.py: makes the ngrams all lowercase.lemmatize_ngrams.py: lemmatizes the ngrams (i.e., reduce them to their base grammatical forms).filter_ngrams.py: screens out undesired tokens (e.g., stop words, numbers, words not in a vocabulary file) from the ngrams.sort_ngrams.py: combines multiple ngrams files into a single sorted file.consolidate_ngrams.py: consolidates duplicate ngrams resulting from the previous steps.index_and_create_vocabulary.py: numerically indexes a list of unigrams and create a "vocabulary file" to screen multigrams.create_yearly_files.py: splits the master corpus into yearly sub-corpora.helpers/file_handler.py: helper script to simplify reading and writing files in the other modules.helpers/print_jsonl_lines.py: helper script to view a snippet of ngrams in a JSONL file.helpers/verify_sort.py: helper script to confirm whether an ngram file is properly sorted.
src/training_tools
train_ngrams.py: trainword2vecmodels on pre-processed multigram corpora.evaluate_models.py: evaluate training quality on intrinsic benchmarks (i.e., similarity and analogy tests).plotting.py: plot various types of model results.w2v_model.py: a Python class (W2VModel) to aid in the evaluation, normalization, and alignment of yearlyword2vecmodels
notebooks
workflow_unigrams.ipynb: Jupyter Notebook showing how to download and preprocess unigrams.workflow_multigrams.ipynb: Jupyter Notebook showing how to download and preprocess multigrams.workflow_training.ipynb: Jupyter Notebook showing how to train, evaluate, and plots results fromword2vecmodels.
Finally, the training_results folder is where a file containing evaluation metrics for a set of models is stored.
System Requirements
Efficiently downloading, processing, and training models on ngrams takes lots of processors and memory. Unless you have a very powerful PC, you should only try to run hist_w2v on a high-performance computing (HPC) cluster or similar platform. On my university's HPC, I typically request 14 cores and 128G of RAM. A priority for development is refactoring the code for individual systems.
Citing hist_w2v
If you use hist_w2v in your research or other publications, I kindly ask you to cite it. Use the GitHub citation to create citation text.
License
This project is released under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hist_w2v-0.1.6.tar.gz.
File metadata
- Download URL: hist_w2v-0.1.6.tar.gz
- Upload date:
- Size: 46.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7357b3ce0c69b1b630331f81f8f6bb355c6fafb3a2babebfbc352061c2aeced4
|
|
| MD5 |
8ffac0ebe87cf5390fcc5b3bec15702d
|
|
| BLAKE2b-256 |
8ea8045f8f236186ee26da1fe28af5f8897c1e9515e78cc3be9e633c4026a819
|
Provenance
The following attestation bundles were made for hist_w2v-0.1.6.tar.gz:
Publisher:
publish-to-test-pypi.yml on eric-d-knowles/hist_w2v
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hist_w2v-0.1.6.tar.gz -
Subject digest:
7357b3ce0c69b1b630331f81f8f6bb355c6fafb3a2babebfbc352061c2aeced4 - Sigstore transparency entry: 193893072
- Sigstore integration time:
-
Permalink:
eric-d-knowles/hist_w2v@e566cb6a0e359388b62bbfa487a83853cdcdc4d4 -
Branch / Tag:
refs/tags/v0.1.6 - Owner: https://github.com/eric-d-knowles
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-test-pypi.yml@e566cb6a0e359388b62bbfa487a83853cdcdc4d4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file hist_w2v-0.1.6-py3-none-any.whl.
File metadata
- Download URL: hist_w2v-0.1.6-py3-none-any.whl
- Upload date:
- Size: 62.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba2545b6ad2f14471bc8cf4901eddcfaaf31b5dde4e3a401bc9026a842abc58e
|
|
| MD5 |
ccbe862698a25899a22eb49f75898f70
|
|
| BLAKE2b-256 |
c642f2bafcd890141f06e588b6dd88a1d3a89e2dff1229e74462af451f2ea69e
|
Provenance
The following attestation bundles were made for hist_w2v-0.1.6-py3-none-any.whl:
Publisher:
publish-to-test-pypi.yml on eric-d-knowles/hist_w2v
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hist_w2v-0.1.6-py3-none-any.whl -
Subject digest:
ba2545b6ad2f14471bc8cf4901eddcfaaf31b5dde4e3a401bc9026a842abc58e - Sigstore transparency entry: 193893076
- Sigstore integration time:
-
Permalink:
eric-d-knowles/hist_w2v@e566cb6a0e359388b62bbfa487a83853cdcdc4d4 -
Branch / Tag:
refs/tags/v0.1.6 - Owner: https://github.com/eric-d-knowles
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-test-pypi.yml@e566cb6a0e359388b62bbfa487a83853cdcdc4d4 -
Trigger Event:
push
-
Statement type: