Skip to main content

A data cleaning library for text processing

Project description

cleanmydata

This library contains all the essential functions for data cleaning.

It takes a list of data cleaning parameters and either a string or pandas dataframe as input

Functions:

  1. Remove new lines
  2. Remove emails
  3. Remove URLs
  4. Remove hashtags (#hashtag)
  5. Remove the string if it contains only numbers
  6. Remove mentions (@user)
  7. Remove retweets (RT...)
  8. Remove text between the square brackets [ ]
  9. Remove multiple whitespaces and replace with one whitespace
  10. Replace characters with more than two occurrences and replace with one occurrence
  11. Remove emojis
  12. Count characters (only for dataframe; creates a new column)
  13. Count words (only for dataframe; creates a new column)
  14. Calculate average word length (only for dataframe; creates a new column)
  15. Count stopwords (only for dataframe; creates two new columns, stowords and stopword_count)
  16. Detect language (uses fasttext-langdetect) (only for dataframe; creates two new columns, lang and lang_prob)
  17. Detect language (uses fasttext-langdetect) (only for dataframe; creates just one column with langauge and probability; takes less time)
  18. Remove HTML tags

How to install?

pip install cleanmydata

Parameters

  1. lst (list) - List of data cleaning operations
  2. data (string or dataframe) - Data to be passed
  3. column (string) - Dataframe column on which operation to perform; only for dataframe
  4. save (boolean) - If you want to save the results in a new file
  5. name (string) - Name of the new file if save is True

Usage

  1. Import the library
    from cleanmydata.functions import *
  2. Call the method clean_data, and pass the parameters as you wish.
  3. By default, if the dataframe is passed, it drops all NA values (dropna)

Examples

  1. To remove emails and hashtags
    mydata = "Hello folks. abc@example.com #hashtag"
    mydata = clean_data(lst=[2, 4], data=mydata)
    print(mydata)
  2. To count stopwords, remove mentions, and URLs, and save file from a dataframe
    df = pd.read_csv('data/my_csv.csv', encoding='ISO-8859-1', dtype='unicode')
    df = clean_data(lst=[15, 6, 2], data=df, column='comments', save=True, name='my custome file name')

Other notes

If using stopwords, make sure you have en_core_web_sm installed.
python -m spacy download en_core_web_sm

More options and enhancements coming soon...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanmydata-0.1.0.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

cleanmydata-0.1.0-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file cleanmydata-0.1.0.tar.gz.

File metadata

  • Download URL: cleanmydata-0.1.0.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for cleanmydata-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b3a1cdafcbd7aa799f6f2b3d461e4b048289928f4f40e6d44a42736a9c05f454
MD5 c1a7db45e65c5a4eaf1dfafb6be5a8fe
BLAKE2b-256 7540b34b2363eeaf6db404e4edd289d0457e4541dffc3a4802152b0e6639e2c6

See more details on using hashes here.

File details

Details for the file cleanmydata-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cleanmydata-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for cleanmydata-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3a6df668140b3a41eb597ada5d6f1b97da3aafa38134ed9bd599592d6a6dbeb8
MD5 626ac2467835245ef49b5b1003589013
BLAKE2b-256 e121f8fd4c1d3fe736d040f5fd8d1aac7c758c98dc6db5d0e75a2b16cc3382be

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page