Skip to main content

A word2vec preprocessing and training package

Project description

Embeddings

Embedding generation with text preprocessing.

Preprocessor

For Word2Vec, we want a soft yet important preprocessing. We want to denoise the text while keeping as much variety and information as possible.

Preprocesses the text/set of text in the following way :

  1. Detects and replaces numbers/float by a generic token 'FLOAT', 'INT'

  2. Add spaces in between punctuation so that tokenisation avoids adding 'word.' to the vocabulary instead of 'word', '.'

  3. Lowers words

  4. Recursive word phrases detection : with a simple probabilistic rule, gathers the tokens 'new', york' to a single token 'new_york'.

  5. Frequency Subsampling : discards unfrequent words with a probability depending on their frequency.

Outputs a vocabulary file and the modified files.

Usage example :

from preprocessing.preprocessor import PreprocessorConfig, Preprocessor
config = PreprocessorConfig('/tmp/logdir')
config.set_config(writing_dir='/tmp/outputs')
config.save_config()


prep = Preprocessor('/tmp/logdir')
prep.fit('~/mydata/')
prep.filter()
prep.transform('~/mydata')

Word2Vec

For the Word2Vec, we just wrote a simple cli wrapper that takes the preprocessed files as an input, trains a Word2Vec model with gensim and writes the vocab, embeddings .tsv files that can be visualized with tensorflow projector (http://projector.tensorflow.org/)

Usage example:

python training_word2vec.py file_dir writing_dir

Documentation

Documentation is available here : https://sally14.github.io/embeddings/

TODO :

  • Clean code for CLI wrapper

  • Also write a python Word2Vec model class so that user doesn't have to switch from python to cli

  • Also write a cli wrapper for preprocessing

  • Memory leak in preprocessor.transform

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embeddings-prep-0.1.0.tar.gz (1.7 kB view hashes)

Uploaded source

Built Distribution

embeddings_prep-0.1.0-py3-none-any.whl (2.1 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page