A Python library for Automated Exploratory Data Analysis, Automated Data Cleaning and Automated Data Preprocessing For Machine Learning and Natural Language Processing Applications in Python.
Project description
Data-Purifier
A Python library for Automated Exploratory Data Analysis, Automated Data Cleaning and Automated Data Preprocessing For Machine Learning and Natural Language Processing Applications in Python.
Table of Contents
Get Started
Install the packages
pip install data-purifier
python -m spacy download en_core_web_sm
Load the module
from datapurifier import Mleda, Nleda, Nlpurifier
Automated EDA for Machine Learning
-
It gives shape, number of categorical and numerical features, description of the dataset, and also the information about the number of null values and their respective percentage.
-
For understanding the distribution of datasets and getting useful insights, there are many interactive plots generated where the user can select his desired column and the system will automatically plot it. Plot includes
- Count plot
- Correlation plot
- Joint plot
- Pair plot
- Pie plot
Code Implementation
Load the dataset and let the magic of automated EDA begin
df = pd.read_csv("./datasets/iris.csv")
ae = Mleda(df)
ae
Automated EDA for NLP
Basic NLP
- It will check for null rows and drop them (if any) and then will perform following analysis row by row and will return dataframe containing those analysis:
- Word Count
- Character Count
- Average Word Length
- Stop Word Count
- Uppercase Word Count
Later you can also observe distribution of above mentioned analysis just by selecting the column from dropdown list, and our system will automatically plot it.
- It can also perform
sentiment analysis
on dataframe row by row, giving the polarity of each sentence (or row), later you can also view thedistribution of polarity
.
Word Analysis
- Can find count of
specific word
mentioned by the user - Plots
wordcloud plot
- Perform
Unigram, Bigram, and Trigram analysis
, returning the dataframe of each and also showing its distribution plot.
Code Implementation
For Automated EDA and Automated Data Cleaning of NL dataset, load the dataset and pass the dataframe along with the targeted column containing textual data.
nlp_df = pd.read_csv("./datasets/twitter16m.csv", header=None, encoding='latin-1')
nlp_df.columns = ["tweets","sentiment"]
Basic Analysis
For Basic EDA, pass the argument basic
as argument in constructor
eda = Nlpeda(nlp_df, "tweets", analyse="basic")
eda.df
Word Analysis
For Word based EDA, pass the argument word
as argument in constructor
eda = Nlpeda(nlp_df, "tweets", analyse="word")
eda.unigram_df # for seeing unigram datfarame
Automated Data Cleaning for NLP
It provides following cleaning techniques, where you have to just tick the checkbox and our system will automatically perform the operation for you.
Features | Features | Features |
---|---|---|
Drop Null Rows | Lower all Words | Contraction to Expansion |
Count Urls | Get Word Count | Count Mails |
Remove Special Characters and Punctuations | Remove Numbers and Alphanumeric words | Remove Stop Words |
Remove Commonly Occuring Words | Remove Mails | Remove Html Tags |
Remove Urls | Remove Multiple Spaces | Remove Accented Characters |
Leammatize | Stemming |
Code Implementation
pure = Nlpurifier(nlp_df, "tweets")
View the processed and purified dataframe
pure.df
Example:
https://colab.research.google.com/drive/1J932G1uzqxUHCMwk2gtbuMQohYZsze8U?usp=sharing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for data_purifier-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1d5e383e71610cb6d4398f1542f2114613c71ef74b3412edc6a7b97da733340e |
|
MD5 | 4a9135b96b47699ccbda1b184b7483a4 |
|
BLAKE2b-256 | 13c29795b0e711ea1d29b89420130037c2d91700a2e188736df298c4c8cebf99 |