A word2vec preprocessing and training package
Project description
Embeddings
Embedding generation with text preprocessing.
Preprocessor
For Word2Vec, we want a soft yet important preprocessing. We want to denoise the text while keeping as much variety and information as possible.
Preprocesses the text/set of text in the following way :
-
Detects and replaces numbers/float by a generic token 'FLOAT', 'INT'
-
Add spaces in between punctuation so that tokenisation avoids adding 'word.' to the vocabulary instead of 'word', '.'
-
Lowers words
-
Recursive word phrases detection : with a simple probabilistic rule, gathers the tokens 'new', york' to a single token 'new_york'.
-
Frequency Subsampling : discards unfrequent words with a probability depending on their frequency.
Outputs a vocabulary file and the modified files.
Usage example :
from preprocessing.preprocessor import PreprocessorConfig, Preprocessor
config = PreprocessorConfig('/tmp/logdir')
config.set_config(writing_dir='/tmp/outputs')
config.save_config()
prep = Preprocessor('/tmp/logdir')
prep.fit('~/mydata/')
prep.filter()
prep.transform('~/mydata')
Word2Vec
For the Word2Vec, we just wrote a simple cli wrapper that takes the preprocessed files as an input, trains a Word2Vec model with gensim and writes the vocab, embeddings .tsv files that can be visualized with tensorflow projector (http://projector.tensorflow.org/)
Usage example:
python training_word2vec.py file_dir writing_dir
Documentation
Documentation is available here : https://sally14.github.io/embeddings/
TODO :
-
Clean code for CLI wrapper
-
Also write a python Word2Vec model class so that user doesn't have to switch from python to cli
-
Also write a cli wrapper for preprocessing
-
Memory leak in preprocessor.transform
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file embeddings-prep-0.1.0.tar.gz.
File metadata
- Download URL: embeddings-prep-0.1.0.tar.gz
- Upload date:
- Size: 1.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6e9c1cca853dd8564817e26947f78779947827922bbe4cf41dd358c4a5ebc36
|
|
| MD5 |
3a351554c1e56e307fa31a7e32a80eb6
|
|
| BLAKE2b-256 |
6a76117d950bae3c6b0a371d0d6b1498c860e69e3112bae0b008090e0033ae9b
|
File details
Details for the file embeddings_prep-0.1.0-py3-none-any.whl.
File metadata
- Download URL: embeddings_prep-0.1.0-py3-none-any.whl
- Upload date:
- Size: 2.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf51e2f3ebcfe7b6039fd251f0d59df3beb097ad6e8267188f502cf8d6039c39
|
|
| MD5 |
5ef26b5874b126769d3e3bfb5541297d
|
|
| BLAKE2b-256 |
f7a67b3af2603f78a6d240a996cb7b62b59a327ca3f3a4903750e0c50c46a2ae
|