A text pre-processing package
Project description
Husky_Simplex
Text processing package
Data preprocessing is the first and most essential stage in developing a machine learning model as it affects the overall accuracy and efficiency of the outcome. Ordinary text data contains non-contextual words, noise, misspelled words, symbols, punctuations, and unnecessary syntactic connotations. To circumvent these hindrances, we need to clean raw text data into data that is acceptable for statistical and computational analysis.
The purpose of the package is to provide a one-stop platform for most of the necessary text preprocessing techniques. These steps are used to augment the computational significance of text data for Natural Language Processing tasks.
Package Functions
- Tokenization - Converting string input to a list of words.
- Word counter - Counting the total number of words in the input.
- Stopword removal - Removing non-contextual words that are only used for the grammatical structure.
- Punctuation removal - Removing punctuations.
- Symbol removal - Removing symbols.
- Stemming - Removing tense connotations.
- Bag of words - Quantifying words.
- Count vectorization - Vectorization of text based on term frequency.
- TF-IDF vectorization - Vectorization of text based on term frequency in relation to document frequency
Installation
pip install husky_simplex
or
git clone https://github.com/Sudhendra/Husky_Simplex.git
cd Husky_Simplex
pip install - r requirements.txt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for husky_simplex-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9adf35240cf7e4942319ca646085a0183a0086de1806f30a009211d5001fa590 |
|
MD5 | 1ba2afdb9717ba81ebf47f6a100820d5 |
|
BLAKE2b-256 | 1b9e024e7b0522789fc9856dcb2b1997a682c6244dd17018e8070e9903da1435 |