Library for data cleaning operations
Project description
cleanmydata
This library contains all the essential functions for data cleaning.
It takes a list of data cleaning parameters and either a string or pandas dataframe as input
Functions:
- Remove new lines
- Remove emails
- Remove URLs
- Remove hashtags (#hashtag)
- Remove the string if it contains only numbers
- Remove mentions (@user)
- Remove retweets (RT...)
- Remove text between the square brackets [ ]
- Remove multiple whitespaces and replace with one whitespace
- Replace characters with more than two occurrences and replace with one occurrence
- Remove emojis
- Count characters (only for dataframe; creates a new column)
- Count words (only for dataframe; creates a new column)
- Calculate average word length (only for dataframe; creates a new column)
- Count stopwords (only for dataframe; creates two new columns, stowords and stopword_count)
- Detect language (uses fasttext-langdetect) (only for dataframe; creates two new columns, lang and lang_prob)
- Detect language (uses fasttext-langdetect) (only for dataframe; creates just one column with langauge and probability; takes less time)
How to install?
pip install cleanmydata
Parameters
- lst (list) - List of data cleaning operations
- data (string or dataframe) - Data to be passed
- column (string) - Dataframe column on which operation to perform; only for dataframe
- save (boolean) - If you want to save the results in a new file
- name (string) - Name of the new file if save is True
Usage
- Import the library
from cleanmydata.functions import clean_data
- Call the method clean_data, and pass the parameters as you wish.
- By default, if the dataframe is passed, it drops all NA values (dropna)
Examples
- To remove emails and hashtags
mydata = "Hello folks. abc@example.com #hashtag"
mydata = clean_data(lst=[2, 4], data=mydata)
print(mydata)
- To count stopwords, remove mentions, and URLs, and save file from a dataframe
df = pd.read_csv('data/my_csv.csv', encoding='ISO-8859-1', dtype='unicode')
df = clean_data(lst=[15, 6, 2], data=df, column='comments', save=True, name='my custome file name')
Other notes
If using stopwords, make sure you have en_core_web_sm installed.
python -m spacy download en_core_web_sm
More options and enhancements coming soon...
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
Close
Hashes for cleanmydata-1.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa34306fd9589100a5c8cca1ee7c67b2b5859f1642fa8a8d59a1038eca5bef88 |
|
MD5 | 32d32b7b67af6580aeb038874b465099 |
|
BLAKE2b-256 | 32b2d7f357826e96b03dc8e2b87a88c75b34f4f3e06c6732e19787ca435e0fa8 |