A word2vec preprocessing and training package
Embedding generation with text preprocessing.
For Word2Vec, we want a soft yet important preprocessing. We want to denoise the text while keeping as much variety and information as possible.
Preprocesses the text/set of text in the following way :
Detects and replaces numbers/float by a generic token 'FLOAT', 'INT'
Add spaces in between punctuation so that tokenisation avoids adding 'word.' to the vocabulary instead of 'word', '.'
Recursive word phrases detection : with a simple probabilistic rule, gathers the tokens 'new', york' to a single token 'new_york'.
Frequency Subsampling : discards unfrequent words with a probability depending on their frequency.
Outputs a vocabulary file and the modified files.
Usage example :
from preprocessing.preprocessor import PreprocessorConfig, Preprocessor config = PreprocessorConfig('/tmp/logdir') config.set_config(writing_dir='/tmp/outputs') config.save_config() prep = Preprocessor('/tmp/logdir') prep.fit('~/mydata/') prep.filter() prep.transform('~/mydata')
For the Word2Vec, we just wrote a simple cli wrapper that takes the preprocessed files as an input, trains a Word2Vec model with gensim and writes the vocab, embeddings .tsv files that can be visualized with tensorflow projector (http://projector.tensorflow.org/)
python training_word2vec.py file_dir writing_dir
Documentation is available here : https://sally14.github.io/embeddings/
Clean code for CLI wrapper
Also write a python Word2Vec model class so that user doesn't have to switch from python to cli
Also write a cli wrapper for preprocessing
Memory leak in preprocessor.transform
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Hashes for embeddings_prep-0.1.0-py3-none-any.whl