Skip to main content

A word2vec preprocessing and training package

Project description

Embeddings

This package is designed to provide easy-to-use python class and cli interfaces to:

  • clean corpuses in an efficient way in terms of computation time

  • generate word2vec embeddings (based on gensim) and directly write them to a format that is compatible with Tensorflow Projector

Thus, with two classes, or two commands, anyone should be able clean a corpus and generate embeddings that can be uploaded and visualized with Tensorflow Projector.

Getting started

Requirements

This packages requires gensim, nltk, and docopt to run. If pip doesn't install this dependencies automatically, you can install it by running :

pip install nltk docopt gensim

Installation

To install this package, simply run :

pip install embeddingsprep

Further versions might include conda builds, but it's currently not the case.

Main features

Preprocessing

For Word2Vec, we want a soft yet important preprocessing. We want to denoise the text while keeping as much variety and information as possible. A detailed version of what is done during the preprocessing is available here

Usage example :

Creating and saving a loadable configuration:

from embeddingsprep.preprocessing.preprocessor import PreprocessorConfig, Preprocessor
config = PreprocessorConfig('/tmp/logdir')
config.set_config(writing_dir='/tmp/outputs')
config.save_config()
prep = Preprocessor('/tmp/logdir')  # Loads the config object in /tmp/logdir if it exists
prep.fit('~/mydata/')  # Fits the unigram & bigrams occurences
prep.filter()  # Filters with all the config parameters
prep.transform('~/mydata')  # Transforms the texts with the filtered vocab. 

Word2Vec

For the Word2Vec, we just wrote a simple wrapper that takes the preprocessed files as an input, trains a Word2Vec model with gensim and writes the vocab, embeddings .tsv files that can be visualized with tensorflow projector (http://projector.tensorflow.org/)

Usage example:

from embeddingsprep.models.word2vec import Word2Vec
model = Word2Vec(emb_size=300, window=5, epochs=3)
model.train('./my-preprocessed-data/')
model.save('./my-output-dir')

Contributing

Any github issue, contribution or suggestion is welcomed! You can open issues on the github repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embeddingsprep-0.1.4.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embeddingsprep-0.1.4-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file embeddingsprep-0.1.4.tar.gz.

File metadata

  • Download URL: embeddingsprep-0.1.4.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for embeddingsprep-0.1.4.tar.gz
Algorithm Hash digest
SHA256 7388278ac508a168c5bd9b014b32667ca29362946bbb32fd70667ed5d833fc9f
MD5 e26ca2c0e9731177c41ac1276bc2968c
BLAKE2b-256 c22a55d5b720ff7e3061169c9ffdc2c1c923553bee01abec1e24ee799253c401

See more details on using hashes here.

File details

Details for the file embeddingsprep-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: embeddingsprep-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for embeddingsprep-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5de4e24900b9afaf845dc73105367abe4b6301d714cbeaabd5ae9037ef0778c1
MD5 8140e4e87b446bef6c795aab32d1ab66
BLAKE2b-256 e957afeb276d062e6e21613f820082379190f7a59a2b1b16e218eca26bf9b528

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page