cleanmydata

A data cleaning library for text processing

These details have not been verified by PyPI

Project links

Homepage

Project description

cleanmydata

This library contains all the essential functions for data cleaning.

It takes a list of data cleaning parameters and either a string or pandas dataframe as input

Functions:

Remove new lines
Remove emails
Remove URLs
Remove hashtags (#hashtag)
Remove the string if it contains only numbers
Remove mentions (@user)
Remove retweets (RT...)
Remove text between the square brackets [ ]
Remove multiple whitespaces and replace with one whitespace
Replace characters with more than two occurrences and replace with one occurrence
Remove emojis
Count characters (only for dataframe; creates a new column)
Count words (only for dataframe; creates a new column)
Calculate average word length (only for dataframe; creates a new column)
Count stopwords (only for dataframe; creates two new columns, stowords and stopword_count)
Detect language (uses fasttext-langdetect) (only for dataframe; creates two new columns, lang and lang_prob)
Detect language (uses fasttext-langdetect) (only for dataframe; creates just one column with langauge and probability; takes less time)
Remove HTML tags

How to install?

pip install cleanmydata

Parameters

lst (list) - List of data cleaning operations
data (string or dataframe) - Data to be passed
column (string) - Dataframe column on which operation to perform; only for dataframe
save (boolean) - If you want to save the results in a new file
name (string) - Name of the new file if save is True

Usage

Import the library
from cleanmydata.functions import *
Call the method clean_data, and pass the parameters as you wish.
By default, if the dataframe is passed, it drops all NA values (dropna)

Examples

To remove emails and hashtags
mydata = "Hello folks. abc@example.com #hashtag"
mydata = clean_data(lst=[2, 4], data=mydata)
print(mydata)
To count stopwords, remove mentions, and URLs, and save file from a dataframe
df = pd.read_csv('data/my_csv.csv', encoding='ISO-8859-1', dtype='unicode')
df = clean_data(lst=[15, 6, 2], data=df, column='comments', save=True, name='my custome file name')

Other notes

If using stopwords, make sure you have en_core_web_sm installed.
python -m spacy download en_core_web_sm

More options and enhancements coming soon...

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.1

Oct 6, 2024

0.1.0

Oct 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanmydata-0.1.1.tar.gz (6.8 kB view details)

Uploaded Oct 6, 2024 Source

Built Distribution

cleanmydata-0.1.1-py3-none-any.whl (7.4 kB view details)

Uploaded Oct 6, 2024 Python 3

File details

Details for the file cleanmydata-0.1.1.tar.gz.

File metadata

Download URL: cleanmydata-0.1.1.tar.gz
Upload date: Oct 6, 2024
Size: 6.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for cleanmydata-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`ffc6b700a342eec8f10f5b0691eef938643f1c80c8883ac8b917625a9d5bcaa5`
MD5	`dc2105b46017d06088d6fd229bb8303f`
BLAKE2b-256	`6b6d20da6dd8d536ada9bc3aefe42531a53e23c2f57008fbae96fc81ba486cc3`

See more details on using hashes here.

File details

Details for the file cleanmydata-0.1.1-py3-none-any.whl.

File metadata

Download URL: cleanmydata-0.1.1-py3-none-any.whl
Upload date: Oct 6, 2024
Size: 7.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for cleanmydata-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`45f62898883f190d17035d7ba127112c7c874a0e34e22aec047f4b3b44e44187`
MD5	`c9f219086ab5b18d86a88ba136b4733f`
BLAKE2b-256	`f1f0de5d4a675b6262841da02dbf056dae14ecba2da91b18c3721db296b82154`