Deep learning utility library for natural language processing that aids in feature engineering and embedding layers.
Project description
DeepZensols Natural Language Processing
Deep learning utility library for natural language processing that aids in feature engineering and embedding layers.
- See the full documentation.
- Paper on arXiv.
Features:
- Configurable layers with little to no need to write code.
- Natural language specific layers:
- Easily configurable word embedding layers for Glove, Word2Vec, fastText.
- Huggingface transformer (BERT) context based word vector layer.
- Full Embedding+BiLSTM-CRF implementation using easy to configure constituent layers.
- NLP specific vectorizers that generate zensols deeplearn encoded and decoded batched tensors for spaCy parsed features, dependency tree features, overlapping text features and others.
- Easily swapable during runtime embedded layers as batched tensors and other linguistic vectorized features.
- Support for token, document and embedding level vectorized features.
- Transformer word piece to linguistic token mapping.
- Two full documented examples provided as both command line and Jupyter notebooks.
- Command line support for training, testing, debugging, and creating predictions.
Documentation
- Full documentation
- Layers: NLP specific layers such as embeddings and transformers
- Vectorizers: specific vectorizers that digitize natural language text in to tensors ready as PyTorch input
- API reference
- Examples
Obtaining
The easiest way to install the command line program is via the pip
installer:
pip3 install zensols.deepnlp
Binaries are also available on pypi.
Usage and Examples
If you're in a rush, you can dive right in to the Clickbate Text Classification example, which is a working project that uses this library. However, you'll either end up reading up on the zensols deeplearn library before or during the tutorial.
The usage of this library is explained in terms of the examples:
-
The Clickbate Text Classification is the best example to start with because the only code consists of is the corpus reader and a module to remove sentence chunking (corpus are newline delimited headlines). Also see the Jupyter clickbate classification notebook.
-
The Movie Review Sentiment trained and tested on the Stanford movie review and Cornell sentiment polarity data sets, which assigns a positive or negative score to a natural language movie review by critics. Also see the Jupyter movie sentiment notebook.
-
The Named Entity Recognizer trained and tested on the CoNLL 2003 data set to label named entities on natural language text. Also see the Jupyter NER notebook.
The unit test cases are also a good resource for the more detailed programming integration with various parts of the library.
Attribution
This project, or example code, uses:
- Gensim for Glove, Word2Vec and fastText word embeddings.
- Huggingface Transformers for BERT contextual word embeddings.
- h5py for fast read access to word embedding vectors.
- zensols nlparse for feature generation from spaCy parsing.
- zensols deeplearn for deep learning network libraries.
Corpora used include:
Citation
If you use this project in your research please use the following BibTeX entry:
@article{Landes_DiEugenio_Caragea_2021,
title={DeepZensols: Deep Natural Language Processing Framework},
url={http://arxiv.org/abs/2109.03383},
note={arXiv: 2109.03383},
journal={arXiv:2109.03383 [cs]},
author={Landes, Paul and Di Eugenio, Barbara and Caragea, Cornelia},
year={2021},
month={Sep}
}
Community
Please star the project and let me know how and where you use this API. Contributions as pull requests, feedback and any input is welcome.
Changelog
An extensive changelog is available here.
License
Copyright (c) 2020 - 2021 Paul Landes
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for zensols.deepnlp-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2233ddd45f20fd280e02630674b100ecc9faf4fb4fac2d52a0e93b7f3787a43e |
|
MD5 | 588685ea87b7182b55bd5047d6acbf6a |
|
BLAKE2b-256 | d92896921da99dbd130a5459337baa4f52f728ef3a81c732d22546e349358cf2 |