Skip to main content

A word2vec preprocessing and training package

Project description

Embeddings

Embedding generation with text preprocessing.

Preprocessor

For Word2Vec, we want a soft yet important preprocessing. We want to denoise the text while keeping as much variety and information as possible.

Preprocesses the text/set of text in the following way :

  1. Detects and replaces numbers/float by a generic token 'FLOAT', 'INT'

  2. Add spaces in between punctuation so that tokenisation avoids adding 'word.' to the vocabulary instead of 'word', '.'

  3. Lowers words

  4. Recursive word phrases detection : with a simple probabilistic rule, gathers the tokens 'new', york' to a single token 'new_york'.

  5. Frequency Subsampling : discards unfrequent words with a probability depending on their frequency.

Outputs a vocabulary file and the modified files.

Usage example :

from preprocessing.preprocessor import PreprocessorConfig, Preprocessor
config = PreprocessorConfig('/tmp/logdir')
config.set_config(writing_dir='/tmp/outputs')
config.save_config()


prep = Preprocessor('/tmp/logdir')
prep.fit('~/mydata/')
prep.filter()
prep.transform('~/mydata')

Word2Vec

For the Word2Vec, we just wrote a simple cli wrapper that takes the preprocessed files as an input, trains a Word2Vec model with gensim and writes the vocab, embeddings .tsv files that can be visualized with tensorflow projector (http://projector.tensorflow.org/)

Usage example:

python training_word2vec.py file_dir writing_dir

Documentation

Documentation is available here : https://sally14.github.io/embeddings/

TODO :

  • Clean code for CLI wrapper

  • Also write a python Word2Vec model class so that user doesn't have to switch from python to cli

  • Also write a cli wrapper for preprocessing

  • Memory leak in preprocessor.transform

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embeddings-prep-0.1.0.tar.gz (1.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embeddings_prep-0.1.0-py3-none-any.whl (2.1 kB view details)

Uploaded Python 3

File details

Details for the file embeddings-prep-0.1.0.tar.gz.

File metadata

  • Download URL: embeddings-prep-0.1.0.tar.gz
  • Upload date:
  • Size: 1.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3

File hashes

Hashes for embeddings-prep-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b6e9c1cca853dd8564817e26947f78779947827922bbe4cf41dd358c4a5ebc36
MD5 3a351554c1e56e307fa31a7e32a80eb6
BLAKE2b-256 6a76117d950bae3c6b0a371d0d6b1498c860e69e3112bae0b008090e0033ae9b

See more details on using hashes here.

File details

Details for the file embeddings_prep-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: embeddings_prep-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3

File hashes

Hashes for embeddings_prep-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf51e2f3ebcfe7b6039fd251f0d59df3beb097ad6e8267188f502cf8d6039c39
MD5 5ef26b5874b126769d3e3bfb5541297d
BLAKE2b-256 f7a67b3af2603f78a6d240a996cb7b62b59a327ca3f3a4903750e0c50c46a2ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page