Skip to main content

A Python library for cleaning and preprocessing text data by removing,emojies,internet words, special characters, digits, HTML tags, URLs, and stopwords.

Project description

TextPrettifier

TextPrettifier is a Python library for cleaning text data by removing HTML tags, URLs, numbers, special characters, contractions, and stopwords.

TextPrettifier Key Features

1. Removing Emojis

The remove_emojis method removes emojis from the text.

2. Removing Internet Words

The remove_internet_words method removes internet-specific words from the text.

3. Removing HTML Tags

The remove_html_tags method removes HTML tags from the text.

4. Removing URLs

The remove_urls method removes URLs from the text.

5. Removing Numbers

The remove_numbers method removes numbers from the text.

6. Removing Special Characters

The remove_special_chars method removes special characters from the text.

7. Expanding Contractions

The remove_contractions method expands contractions in the text.

8. Removing Stopwords

The remove_stopwords method removes stopwords from the text.

Additional Functionality

  • If is_lower and is_token are both True, the text is returned in lowercase and as a list of tokens.
  • If only is_lower is True, the text is returned in lowercase.
  • If only is_token is True, the text is returned as a list of tokens.
  • If neither is_lower nor is_token is True, the text is returned as is.

Installation

You can install TextPrettifier using pip:

pip install text-prettifier
from text_prettifier import TextPrettifier

Initialize TextPrettifier

text_prettifier = TextPrettifier()

Example: Remove Emojis

html_text = "Hi,Pythonogist! I ❤️ Python."
cleaned_html = text_prettifier.remove_emojis(html_text)
print(cleaned_html)

Output Hi,Pythonogist! I Python.

Example: Remove HTML tags

html_text = "<p>Hello, <b>world</b>!</p>"
cleaned_html = text_prettifier.remove_html_tags(html_text)
print(cleaned_html)

Output Hello,world!

Example: Remove URLs

url_text = "Visit our website at https://example.com"
cleaned_urls = text_prettifier.remove_urls(url_text)
print(cleaned_urls)

Output Visit our webiste at

Example: Remove numbers

number_text = "There are 123 apples"
cleaned_numbers = text_prettifier.remove_numbers(number_text)
print(cleaned_numbers)

Output There are apples

Example: Remove special characters

special_text = "Hello, @world!"
cleaned_special = text_prettifier.remove_special_chars(special_text)
print(cleaned_special)

Output Hello world

Example: Remove contractions

contraction_text = "I can't do it"
cleaned_contractions = text_prettifier.remove_contractions(contraction_text)
print(cleaned_contractions)

Output I cannot do it

Example: Remove stopwords

stopwords_text = "This is a test"
cleaned_stopwords = text_prettifier.remove_stopwords(stopwords_text)
print(cleaned_stopwords)

Output This test

Example: Apply all cleaning methods

all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text)
print(all_cleaned)

Output Hello world 123 apples cannot test

If you are interested to tokenized and lower the cleaned text write the code

all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text,is_token=True,is_lower=True)
print(all_cleaned)

Output ['Hello','world', '123','apples', 'cannot','test']

Note: I didn't include remove_numbers in sigma_cleaner because sometimes numbers carry useful information in term of NLP. If you want to remove number you can apply this method seperately on output of sigma_cleaner.

Contact Information

Feel free to reach out to me on social media:

GitHub LinkedIn Twitter Facebook

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_prettifier-1.1.4.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

text_prettifier-1.1.4-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file text_prettifier-1.1.4.tar.gz.

File metadata

  • Download URL: text_prettifier-1.1.4.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for text_prettifier-1.1.4.tar.gz
Algorithm Hash digest
SHA256 cfee5fc8c43960037321b5245982d4f79a787bffaf65dc63dfe6dd8a9a9f6286
MD5 d15b8066eba015c33db1bee656f1c4bd
BLAKE2b-256 96301058adc542addde82034ecd099cb74816e09ed87d5ba86d89aec8a38a888

See more details on using hashes here.

File details

Details for the file text_prettifier-1.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for text_prettifier-1.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 a3db624155eaf151f32f9a22aca5481696be112201c133269cc603b6171b72ae
MD5 54330abbe2da48e9143d8b02d2e1334e
BLAKE2b-256 7216f75decea8eae64f145a52a2d586ba683120d6933b2362ffa6a6a7bb3fe87

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page