Skip to main content

A Python library for cleaning and preprocessing text data by removing,emojies,internet words, special characters, digits, HTML tags, URLs, and stopwords.

Project description

TextPrettifier

TextPrettifier is a Python library for cleaning text data by removing HTML tags, URLs, numbers, special characters, contractions, and stopwords.

TextPrettifier Key Features

1. Removing Emojis

The remove_emojis method removes emojis from the text.

2. Removing Internet Words

The remove_internet_words method removes internet-specific words from the text.

3. Removing HTML Tags

The remove_html_tags method removes HTML tags from the text.

4. Removing URLs

The remove_urls method removes URLs from the text.

5. Removing Numbers

The remove_numbers method removes numbers from the text.

6. Removing Special Characters

The remove_special_chars method removes special characters from the text.

7. Expanding Contractions

The remove_contractions method expands contractions in the text.

8. Removing Stopwords

The remove_stopwords method removes stopwords from the text.

Additional Functionality

  • If is_lower and is_token are both True, the text is returned in lowercase and as a list of tokens.
  • If only is_lower is True, the text is returned in lowercase.
  • If only is_token is True, the text is returned as a list of tokens.
  • If neither is_lower nor is_token is True, the text is returned as is.

Installation

You can install TextPrettifier using pip:

pip install text-prettifier
from text_prettifier import TextPrettifier

Initialize TextPrettifier

text_prettifier = TextPrettifier()

Example: Remove Emojis

html_text = "Hi,Pythonogist! I ❤️ Python."
cleaned_html = text_prettifier.remove_emojis(html_text)
print(cleaned_html)

Output Hi,Pythonogist! I Python.

Example: Remove HTML tags

html_text = "<p>Hello, <b>world</b>!</p>"
cleaned_html = text_prettifier.remove_html_tags(html_text)
print(cleaned_html)

Output Hello,world!

Example: Remove URLs

url_text = "Visit our website at https://example.com"
cleaned_urls = text_prettifier.remove_urls(url_text)
print(cleaned_urls)

Output Visit our webiste at

Example: Remove numbers

number_text = "There are 123 apples"
cleaned_numbers = text_prettifier.remove_numbers(number_text)
print(cleaned_numbers)

Output There are apples

Example: Remove special characters

special_text = "Hello, @world!"
cleaned_special = text_prettifier.remove_special_chars(special_text)
print(cleaned_special)

Output Hello world

Example: Remove contractions

contraction_text = "I can't do it"
cleaned_contractions = text_prettifier.remove_contractions(contraction_text)
print(cleaned_contractions)

Output I cannot do it

Example: Remove stopwords

stopwords_text = "This is a test"
cleaned_stopwords = text_prettifier.remove_stopwords(stopwords_text)
print(cleaned_stopwords)

Output This test

Example: Apply all cleaning methods

all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text)
print(all_cleaned)

Output Hello world 123 apples cannot test

If you are interested to tokenized and lower the cleaned text write the code

all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text,is_token=True,is_lower=True)
print(all_cleaned)

Output ['Hello','world', '123','apples', 'cannot','test']

Note: I didn't include remove_numbers in sigma_cleaner because sometimes numbers carry useful information in term of NLP. If you want to remove number you can apply this method seperately on output of sigma_cleaner.

Contact Information

Feel free to reach out to me on social media:

GitHub LinkedIn Twitter Facebook

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_prettifier-1.1.2.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

text_prettifier-1.1.2-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file text_prettifier-1.1.2.tar.gz.

File metadata

  • Download URL: text_prettifier-1.1.2.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for text_prettifier-1.1.2.tar.gz
Algorithm Hash digest
SHA256 39b8586509df72808caacb62caacd814a25d6b770b119faac9d3377acc26c86d
MD5 216e28a469c8f0619582974b7e481ca8
BLAKE2b-256 394d721f56106195013fbc2af396a4d6b8c27ca55f16547d55c9d241c2a3214c

See more details on using hashes here.

File details

Details for the file text_prettifier-1.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for text_prettifier-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d116c6bffbc800930ade65847e717b3c6b0911ca29b17e87c3d67691b4afd999
MD5 efc289f7abc2a7fabcab242e84bbeb70
BLAKE2b-256 36e7d4d978480c265eb100649420178a54bc6ff2a0d14af825b2bed3f50141ad

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page