A data cleaning library for text processing
Project description
cleanmydata
This library contains all the essential functions for data cleaning.
It takes a list of data cleaning parameters and either a string or pandas dataframe as input
Functions:
- Remove new lines
- Remove emails
- Remove URLs
- Remove hashtags (#hashtag)
- Remove the string if it contains only numbers
- Remove mentions (@user)
- Remove retweets (RT...)
- Remove text between the square brackets [ ]
- Remove multiple whitespaces and replace with one whitespace
- Replace characters with more than two occurrences and replace with one occurrence
- Remove emojis
- Count characters (only for dataframe; creates a new column)
- Count words (only for dataframe; creates a new column)
- Calculate average word length (only for dataframe; creates a new column)
- Count stopwords (only for dataframe; creates two new columns, stowords and stopword_count)
- Detect language (uses fasttext-langdetect) (only for dataframe; creates two new columns, lang and lang_prob)
- Detect language (uses fasttext-langdetect) (only for dataframe; creates just one column with langauge and probability; takes less time)
- Remove HTML tags
How to install?
pip install cleanmydata
Parameters
- lst (list) - List of data cleaning operations
- data (string or dataframe) - Data to be passed
- column (string) - Dataframe column on which operation to perform; only for dataframe
- save (boolean) - If you want to save the results in a new file
- name (string) - Name of the new file if save is True
Usage
- Import the library
from cleanmydata.functions import *
- Call the method clean_data, and pass the parameters as you wish.
- By default, if the dataframe is passed, it drops all NA values (dropna)
Examples
- To remove emails and hashtags
mydata = "Hello folks. abc@example.com #hashtag"
mydata = clean_data(lst=[2, 4], data=mydata)
print(mydata)
- To count stopwords, remove mentions, and URLs, and save file from a dataframe
df = pd.read_csv('data/my_csv.csv', encoding='ISO-8859-1', dtype='unicode')
df = clean_data(lst=[15, 6, 2], data=df, column='comments', save=True, name='my custome file name')
Other notes
If using stopwords, make sure you have en_core_web_sm installed.
python -m spacy download en_core_web_sm
More options and enhancements coming soon...
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cleanmydata-0.1.1.tar.gz
(6.8 kB
view details)
Built Distribution
File details
Details for the file cleanmydata-0.1.1.tar.gz
.
File metadata
- Download URL: cleanmydata-0.1.1.tar.gz
- Upload date:
- Size: 6.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffc6b700a342eec8f10f5b0691eef938643f1c80c8883ac8b917625a9d5bcaa5 |
|
MD5 | dc2105b46017d06088d6fd229bb8303f |
|
BLAKE2b-256 | 6b6d20da6dd8d536ada9bc3aefe42531a53e23c2f57008fbae96fc81ba486cc3 |
File details
Details for the file cleanmydata-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: cleanmydata-0.1.1-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 45f62898883f190d17035d7ba127112c7c874a0e34e22aec047f4b3b44e44187 |
|
MD5 | c9f219086ab5b18d86a88ba136b4733f |
|
BLAKE2b-256 | f1f0de5d4a675b6262841da02dbf056dae14ecba2da91b18c3721db296b82154 |