A fast framework for pre-processing (Cleaning text, Reduction of vocabulary, Feature extraction and Vectorization). Implemented with parallel processing using custom number of processes.

These details have not been verified by PyPI

Project links

Homepage

Project description

Preprocess NLP Text

Framework Description

A simple and fast framework for

Preprocessing or Cleaning of text
Extracting top words or reduction of vocabulary
Feature Extraction
Word Vectorization

Uses parallel execution by leveraging the multiprocessing library in Python for cleaning of text, extracting top words and feature extraction modules. Contains both sequential and parallel ways (For less CPU intensive processes) for preprocessing text with an option of user-defined number of processes.

PS: There is no multi-processing support for word vectorization

Cleaning Text - Clean text with various defined stages implemented using standardized techniques in Natural Language Processing (NLP)
Vocab Reduction - Find the top words in the corpus, lets you choose a threshold to consider the words that can stay in the corpus and replaces the others
Feature Extraction - Extract features from corpus of text using SpaCy
Word Vectorization - Simple code to convert words to vectors (TFIDF, Word2Vec, GloVe) using Scikit-learn and Gensim

Preprocess/Cleaning Module

Uses nltk for few of the stages defined below. Various stages of cleaning include:

Stage	Description
remove_tags_nonascii	Remove HTML tags, emails, URLs, non-ascii characters and converts accented characters
lower_case	Converts the text to lower_case
expand_contractions	Expands the word contractions
remove_punctuation	Remove punctuation from text, but sentences are seperated by ' . '
remove_esacape_chars	Remove escapse characters like \n, \t etc
remove_stopwords	Remove stopwords using nltk python
remove_numbers	Remove all digits in the text
lemmatize	Uses WordNetLemmatizer to lemmatize text
stemming	Uses SnowballStemmer for stemming of text
min_word_len	Minimum word length to keep in text

Reduction of Vocabulary

Shortlists top words based on the percentage as input. Replaces the words not shortlisted and replaces them efficienctly. Also, supports parallel and sequential processing.

Feature Extraction Module

Uses Spacy Pipe module to avoid unnecessary parsing to increase speed. Various stages of feature extraction include:

Stage	Description
nouns	Extract the list of Nouns from the given string
verbs	Extract the list of Verbs from the given string
adjs	Extract the list of Adjectives from the given string
noun_phrases	Extract the list of Noun Phrases (Noun chunks) from the given string
keywords	Uses YAKE for extracting keywords from text
ner	Extracts Person, Location and Organization as named entities
numbers	Extracts all digits in the text

Word Vectorization

Functions written in python to convert words to vectors using libraries like Scikit-Learn and Gensim. Contains four vectorization techniques like CountVectorizer (Bag of Words Model), TFIDF-Vectorizer, Word2Vec and GloVe. Also contains others features to get the top words according to IDF Scores, similar words with similarity scores and average sentence-wise vectors.

Code - Components

Various Python files and their purposes are mentioned here:

preprocess_nlp.py - Contains functions which are built around existing techniques for preprocessing or cleaning text. Defines both sequential and parallel ways of code execution for preprocessing.
Preprocessing_Example_Notebook.ipynb - How-to-use example notebook for preprocessing or cleaning stages
requirements.txt - Required libraries to run the project
vocab_elimination_nlp.py - Contains functions which are built around existing techniques for shortlisting top words and reducing vocab size
Vocab_Elimination_Example_Notebook.ipynb - How-to-use example notebook for vocabulary reduction/elimination or replacement.
vectorization_nlp.py - Contains functions which are built around existing techniques for vectorizing words.
Vectorization_Example_Notebook.ipynb - How-to-use example notebook for vectorization of words and additional functions or features.

How to run

pip install -r requirements.txt
Import preprocess_nlp.py and use the functions preprocess_nlp(for sequential) and asyn_call_preprocess(for parallel) as defined in notebook
Import vocab_elimination_nlp.py and use functions as defined in the notebook Vocab_Elimination_Example_Notebook.ipynb
Import feature_extraction.py and use functions as defined in notebook Feature_Extraction_Example_Notebook.ipynb
Import vectorization_nlp.py and use functions as defined in notebook Vectorization_Example_Notebook.ipynb

Sequential & Parallel Processing

Sequential - Processes records in a sequential order, does not consume a lot of CPU Memory but is slower compared to Parallel processing
Parallel - Can create multiple processes (customizable/user-defined) to preprocess text parallelly, Memory intensive and faster

Refer the code for Docstrings and other function related documentation.
Cheers :)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1

May 11, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

preprocess_nlp-0.1-py3-none-any.whl (3.7 kB view details)

Uploaded May 11, 2020 Python 3

File details

Details for the file preprocess_nlp-0.1-py3-none-any.whl.

File metadata

Download URL: preprocess_nlp-0.1-py3-none-any.whl
Upload date: May 11, 2020
Size: 3.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.6.10

File hashes

Hashes for preprocess_nlp-0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1756f9e48a879e223764774601b94b1778570db4ec69e3cabd1ec66ce6342bb7`
MD5	`235ee96491480fe6c13a6da3a39c267e`
BLAKE2b-256	`13979d36a09b528ff6b2257a7f10d04c9511164bd0f61e1fcb835e2f139d2a14`

See more details on using hashes here.

preprocess-nlp 0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Preprocess NLP Text

Framework Description

Preprocess/Cleaning Module

Reduction of Vocabulary

Feature Extraction Module

Word Vectorization

Code - Components

How to run

Sequential & Parallel Processing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes