Package for working with word vector embeddings
Project description
Word Vector Embedding Package
wordvecpy is a library for processing text data, tokenizing it, and building word vector dictionaries and whole word vector embeddings from the corpus text.
TextProcessor takes a corpus of unprocessed text and processes it for use with word vectors. Punctuation, stopwords, substitutions, contractions, and lemmatization can all be customized. TextProcessor can store a processed (or unprocessed) text corpus (with or without it's associated labels) in a specially designed text format that can interact with other classes in the library. This gives extra functionality when dealing with large text corpii which take up a lot of local memory that is needed for other processes, like building the vector embedded forms of the corpus. The LoadCorpus class allows pulling only selected chunks of the dataset into memory for processing at one time, so large files can be processed in small chunks that fit in your computer's memory. Chunkify works with this and with other datasets completely stored in memory to split a corpus into chunks and generate vector embeddings one chunk at a time (vector embeddings for multiple docs can quickly take up all of a computer's memory if you don't manage the memory well).
VectorDictionary loads pretrained word embeddings from .txt files so they can be used with other classes. Every class in this package that requires a vector dictionary can take a pymagnitude vector or a VectorDictionary object.
Vectokenizer and FastVectokenizer both convert processed text corpus into integer embeddings and create vector dictionaries for those associated integer embeddings. Both classes do the exact same thing but FastVectokenizer requires Keras to create integer embeddings and Vectokenizer does not. Both classes allow you to easily and quickly generate necessary integer embeddings and word vector dictionaries to import directly into a Keras embedding layer. I've considered removing Vectokenizer but I have future plans for it's functionality that cannot be easily done using keras to tokenize the corpus.
VectorEmbedder is where the magic happens and a text corpus (or chunk of a text corpus) is embedded with word vectors. This can be done with both static vector dictionaries like Stanford's various GLoVe embeddings or with dynamic embeddings that are context dependent like Google's ELMO embeddings.
Current Version is 1.0
The current version 1.0 is significantly different from the previous version. Multiple classes from the previous version were removed and redesigned or recombined with new classes in a more organic manner. I've done significant testing on this version to make sure everything works properly but it's obviously very time consuming to test every possible combination of interactions. If you come across a use case where wordvecpy fails, please please please let me know.
Installation
Use the package manager pip to install wordvecpy.
pip install wordvecpy
Usage
Some use cases can be found in the various python notebooks in my exploration of the Rotten Tomatoes dataset found here: https://github.com/metriczulu/rotten-tomato-reviews
Future Plans
Currently working on functionality to reduce the size of integer embeddings by clustering words based on their vector representations. This will hopefully allow smaller vector dictionaries to be used while maintaining good functionality.
Also working on functionality to build multi-channel (or concatenated, your choice) embeddings by combining multiple embedding frameworks together.
Currently, it is possible to use word vector embeddings with PyTorch by converting all documents to embeddings with VectorEmbedder. There is a better, more organic way to work with PyTorch than this and I plan on implementing it once I'm satisfied with the overall functionality of the package. Currently, only Keras (and TF with it) are directly supported as it was the easiest library to get the package functioning and out the door quickest.
Vectokenizer currently has a memory intensive ad-hoc design (which is why FastVectokenizer is currently much better). I will be reworking it so that it is in-line with FastVectokenizer's speed while providing the additional capability of clustering words by vector to reduce dictionary size mentioned above.
I also need to get the comments and documentation up to par to make it easier for others to get it working.
License
No license but I'm just throwing this out there: if this library fails miserably and turns your computer into a black hole which swallows your life's work, I'd like to say first and foremost, I'm sorry. However, use at your own risk.
wordvecpy
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file wordvecpy-1.1.tar.gz
.
File metadata
- Download URL: wordvecpy-1.1.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c0c1a3631526dac533ebdc14d40a36ef6d3fd20a4ae77a8d50050ccc7c492aa |
|
MD5 | 546ec1907642de6c05012907ba964e2a |
|
BLAKE2b-256 | 076836582ef2c56347b613c3ab77b9526de8014f2e329de4e1f3c7cc18067912 |