Skip to main content

Package for working with word vector embeddings

Project description

Word Vector Embedding Package

wordvecpy is a library for processing text data, tokenizing it, and building word vector dictionaries and whole word vector embeddings from the corpus text.

TextProcessor takes a corpus of unprocessed text and processes it for use with word vectors. Punctuation, stopwords, substitutions, contractions, and lemmatization can all be customized. TextProcessor can store a processed (or unprocessed) text corpus (with or without it's associated labels) in a specially designed text format that can interact with other classes in the library. This gives extra functionality when dealing with large text corpii which take up a lot of local memory that is needed for other processes, like building the vector embedded forms of the corpus. The LoadCorpus class allows pulling only selected chunks of the dataset into memory for processing at one time, so large files can be processed in small chunks that fit in your computer's memory. Chunkify works with this and with other datasets completely stored in memory to split a corpus into chunks and generate vector embeddings one chunk at a time (vector embeddings for multiple docs can quickly take up all of a computer's memory if you don't manage the memory well).

VectorDictionary loads pretrained word embeddings from .txt files so they can be used with other classes. Every class in this package that requires a vector dictionary can take a pymagnitude vector or a VectorDictionary object.

Vectokenizer and FastVectokenizer both convert processed text corpus into integer embeddings and create vector dictionaries for those associated integer embeddings. Both classes do the exact same thing but FastVectokenizer requires Keras to create integer embeddings and Vectokenizer does not. Both classes allow you to easily and quickly generate necessary integer embeddings and word vector dictionaries to import directly into a Keras embedding layer. I've considered removing Vectokenizer but I have future plans for it's functionality that cannot be easily done using keras to tokenize the corpus.

VectorEmbedder is where the magic happens and a text corpus (or chunk of a text corpus) is embedded with word vectors. This can be done with both static vector dictionaries like Stanford's various GLoVe embeddings or with dynamic embeddings that are context dependent like Google's ELMO embeddings.

Current Version is 1.0

The current version 1.0 is significantly different from the previous version. Multiple classes from the previous version were removed and redesigned or recombined with new classes in a more organic manner. I've done significant testing on this version to make sure everything works properly but it's obviously very time consuming to test every possible combination of interactions. If you come across a use case where wordvecpy fails, please please please let me know.

Installation

Use the package manager pip to install wordvecpy.

pip install wordvecpy

Usage

Some use cases can be found in the various python notebooks in my exploration of the Rotten Tomatoes dataset found here: https://github.com/metriczulu/rotten-tomato-reviews

Future Plans

Currently working on functionality to reduce the size of integer embeddings by clustering words based on their vector representations. This will hopefully allow smaller vector dictionaries to be used while maintaining good functionality.

Also working on functionality to build multi-channel (or concatenated, your choice) embeddings by combining multiple embedding frameworks together.

Currently, it is possible to use word vector embeddings with PyTorch by converting all documents to embeddings with VectorEmbedder. There is a better, more organic way to work with PyTorch than this and I plan on implementing it once I'm satisfied with the overall functionality of the package. Currently, only Keras (and TF with it) are directly supported as it was the easiest library to get the package functioning and out the door quickest.

Vectokenizer currently has a memory intensive ad-hoc design (which is why FastVectokenizer is currently much better). I will be reworking it so that it is in-line with FastVectokenizer's speed while providing the additional capability of clustering words by vector to reduce dictionary size mentioned above.

I also need to get the comments and documentation up to par to make it easier for others to get it working.

License

No license but I'm just throwing this out there: if this library fails miserably and turns your computer into a black hole which swallows your life's work, I'd like to say first and foremost, I'm sorry. However, use at your own risk.

wordvecpy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordvecpy-1.1.tar.gz (16.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page