A word2vec preprocessing and training package
Project description
Embeddings
Embedding generation with text preprocessing.
Preprocessor
For Word2Vec, we want a soft yet important preprocessing. We want to denoise the text while keeping as much variety and information as possible.
Preprocesses the text/set of text in the following way :
-
Detects and replaces numbers/float by a generic token 'FLOAT', 'INT'
-
Add spaces in between punctuation so that tokenisation avoids adding 'word.' to the vocabulary instead of 'word', '.'
-
Lowers words
-
Recursive word phrases detection : with a simple probabilistic rule, gathers the tokens 'new', york' to a single token 'new_york'.
-
Frequency Subsampling : discards unfrequent words with a probability depending on their frequency.
Outputs a vocabulary file and the modified files.
Usage example :
from preprocessing.preprocessor import PreprocessorConfig, Preprocessor
config = PreprocessorConfig('/tmp/logdir')
config.set_config(writing_dir='/tmp/outputs')
config.save_config()
prep = Preprocessor('/tmp/logdir')
prep.fit('~/mydata/')
prep.filter()
prep.transform('~/mydata')
Word2Vec
For the Word2Vec, we just wrote a simple cli wrapper that takes the preprocessed files as an input, trains a Word2Vec model with gensim and writes the vocab, embeddings .tsv files that can be visualized with tensorflow projector (http://projector.tensorflow.org/)
Usage example:
python training_word2vec.py file_dir writing_dir
Documentation
Documentation is available here : https://sally14.github.io/embeddings/
TODO :
-
Clean code for CLI wrapper
-
Also write a python Word2Vec model class so that user doesn't have to switch from python to cli
-
Also write a cli wrapper for preprocessing
-
Memory leak in preprocessor.transform
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for embeddings_prep-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf51e2f3ebcfe7b6039fd251f0d59df3beb097ad6e8267188f502cf8d6039c39 |
|
MD5 | 5ef26b5874b126769d3e3bfb5541297d |
|
BLAKE2b-256 | f7a67b3af2603f78a6d240a996cb7b62b59a327ca3f3a4903750e0c50c46a2ae |