A word2vec preprocessing and training package
Project description
Embeddings
This package is designed to provide easy-to-use python class and cli interfaces to:
-
clean corpuses in an efficient way in terms of computation time
-
generate word2vec embeddings (based on gensim) and directly write them to a format that is compatible with Tensorflow Projector
Thus, with two classes, or two commands, anyone should be able clean a corpus and generate embeddings that can be uploaded and visualized with Tensorflow Projector.
Getting started
Requirements
This packages requires gensim
, nltk
, and docopt
to run. If
pip doesn't install this dependencies automatically, you can install it by
running :
pip install nltk docopt gensim
Installation
To install this package, simply run :
pip install embeddingsprep
Further versions might include conda builds, but it's currently not the case.
Main features
Preprocessing
For Word2Vec, we want a soft yet important preprocessing. We want to denoise the text while keeping as much variety and information as possible. A detailed version of what is done during the preprocessing is available here
Usage example :
Creating and saving a loadable configuration:
from embeddingsprep.preprocessing.preprocessor import PreprocessorConfig, Preprocessor
config = PreprocessorConfig('/tmp/logdir')
config.set_config(writing_dir='/tmp/outputs')
config.save_config()
prep = Preprocessor('/tmp/logdir') # Loads the config object in /tmp/logdir if it exists
prep.fit('~/mydata/') # Fits the unigram & bigrams occurences
prep.filter() # Filters with all the config parameters
prep.transform('~/mydata') # Transforms the texts with the filtered vocab.
Word2Vec
For the Word2Vec, we just wrote a simple wrapper that takes the preprocessed files as an input, trains a Word2Vec model with gensim and writes the vocab, embeddings .tsv files that can be visualized with tensorflow projector (http://projector.tensorflow.org/)
Usage example:
from embeddingsprep.models.word2vec import Word2Vec
model = Word2Vec(emb_size=300, window=5, epochs=3)
model.train('./my-preprocessed-data/')
model.save('./my-output-dir')
Contributing
Any github issue, contribution or suggestion is welcomed! You can open issues on the github repository.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for embeddingsprep-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5de4e24900b9afaf845dc73105367abe4b6301d714cbeaabd5ae9037ef0778c1 |
|
MD5 | 8140e4e87b446bef6c795aab32d1ab66 |
|
BLAKE2b-256 | e957afeb276d062e6e21613f820082379190f7a59a2b1b16e218eca26bf9b528 |