A data cleaning library for text processing
Project description
cleanmydata
This library contains all the essential functions for data cleaning.
It takes a list of data cleaning parameters and either a string or pandas dataframe as input
Functions:
- Remove new lines
- Remove emails
- Remove URLs
- Remove hashtags (#hashtag)
- Remove the string if it contains only numbers
- Remove mentions (@user)
- Remove retweets (RT...)
- Remove text between the square brackets [ ]
- Remove multiple whitespaces and replace with one whitespace
- Replace characters with more than two occurrences and replace with one occurrence
- Remove emojis
- Count characters (only for dataframe; creates a new column)
- Count words (only for dataframe; creates a new column)
- Calculate average word length (only for dataframe; creates a new column)
- Count stopwords (only for dataframe; creates two new columns, stowords and stopword_count)
- Detect language (uses fasttext-langdetect) (only for dataframe; creates two new columns, lang and lang_prob)
- Detect language (uses fasttext-langdetect) (only for dataframe; creates just one column with langauge and probability; takes less time)
- Remove HTML tags
How to install?
pip install cleanmydata
Parameters
- lst (list) - List of data cleaning operations
- data (string or dataframe) - Data to be passed
- column (string) - Dataframe column on which operation to perform; only for dataframe
- save (boolean) - If you want to save the results in a new file
- name (string) - Name of the new file if save is True
Usage
- Import the library
from cleanmydata.functions import *
- Call the method clean_data, and pass the parameters as you wish.
- By default, if the dataframe is passed, it drops all NA values (dropna)
Examples
- To remove emails and hashtags
mydata = "Hello folks. abc@example.com #hashtag"
mydata = clean_data(lst=[2, 4], data=mydata)
print(mydata)
- To count stopwords, remove mentions, and URLs, and save file from a dataframe
df = pd.read_csv('data/my_csv.csv', encoding='ISO-8859-1', dtype='unicode')
df = clean_data(lst=[15, 6, 2], data=df, column='comments', save=True, name='my custome file name')
Other notes
If using stopwords, make sure you have en_core_web_sm installed.
python -m spacy download en_core_web_sm
More options and enhancements coming soon...
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cleanmydata-0.1.0.tar.gz
(5.3 kB
view details)
Built Distribution
File details
Details for the file cleanmydata-0.1.0.tar.gz
.
File metadata
- Download URL: cleanmydata-0.1.0.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3a1cdafcbd7aa799f6f2b3d461e4b048289928f4f40e6d44a42736a9c05f454 |
|
MD5 | c1a7db45e65c5a4eaf1dfafb6be5a8fe |
|
BLAKE2b-256 | 7540b34b2363eeaf6db404e4edd289d0457e4541dffc3a4802152b0e6639e2c6 |
File details
Details for the file cleanmydata-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: cleanmydata-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a6df668140b3a41eb597ada5d6f1b97da3aafa38134ed9bd599592d6a6dbeb8 |
|
MD5 | 626ac2467835245ef49b5b1003589013 |
|
BLAKE2b-256 | e121f8fd4c1d3fe736d040f5fd8d1aac7c758c98dc6db5d0e75a2b16cc3382be |