Skip to main content

A Python library for cleaning and preprocessing text data by removing,emojies,internet words, special characters, digits, HTML tags, URLs, and stopwords.

Project description

TextPrettifier

TextPrettifier is a Python library for cleaning text data by removing HTML tags, URLs, numbers, special characters, contractions, and stopwords.

TextPrettifier Key Features

1. Removing Emojis

The remove_emojis method removes emojis from the text.

2. Removing Internet Words

The remove_internet_words method removes internet-specific words from the text.

3. Removing HTML Tags

The remove_html_tags method removes HTML tags from the text.

4. Removing URLs

The remove_urls method removes URLs from the text.

5. Removing Numbers

The remove_numbers method removes numbers from the text.

6. Removing Special Characters

The remove_special_chars method removes special characters from the text.

7. Expanding Contractions

The remove_contractions method expands contractions in the text.

8. Removing Stopwords

The remove_stopwords method removes stopwords from the text.

Additional Functionality

  • If is_lower and is_token are both True, the text is returned in lowercase and as a list of tokens.
  • If only is_lower is True, the text is returned in lowercase.
  • If only is_token is True, the text is returned as a list of tokens.
  • If neither is_lower nor is_token is True, the text is returned as is.

Installation

You can install TextPrettifier using pip:

pip install text-prettifier
from text_prettifier import TextPrettifier

Initialize TextPrettifier

text_prettifier = TextPrettifier()

Example: Remove Emojis

html_text = "Hi,Pythonogist! I ❤️ Python."
cleaned_html = text_prettifier.remove_emojis(html_text)
print(cleaned_html)

Output Hi,Pythonogist! I Python.

Example: Remove HTML tags

html_text = "<p>Hello, <b>world</b>!</p>"
cleaned_html = text_prettifier.remove_html_tags(html_text)
print(cleaned_html)

Output Hello,world!

Example: Remove URLs

url_text = "Visit our website at https://example.com"
cleaned_urls = text_prettifier.remove_urls(url_text)
print(cleaned_urls)

Output Visit our webiste at

Example: Remove numbers

number_text = "There are 123 apples"
cleaned_numbers = text_prettifier.remove_numbers(number_text)
print(cleaned_numbers)

Output There are apples

Example: Remove special characters

special_text = "Hello, @world!"
cleaned_special = text_prettifier.remove_special_chars(special_text)
print(cleaned_special)

Output Hello world

Example: Remove contractions

contraction_text = "I can't do it"
cleaned_contractions = text_prettifier.remove_contractions(contraction_text)
print(cleaned_contractions)

Output I cannot do it

Example: Remove stopwords

stopwords_text = "This is a test"
cleaned_stopwords = text_prettifier.remove_stopwords(stopwords_text)
print(cleaned_stopwords)

Output This test

Example: Apply all cleaning methods

all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text)
print(all_cleaned)

Output Hello world 123 apples cannot test

If you are interested to tokenized and lower the cleaned text write the code

all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text,is_token=True,is_lower=True)
print(all_cleaned)

Output ['Hello','world', '123','apples', 'cannot','test']

Note: I didn't include remove_numbers in sigma_cleaner because sometimes numbers carry useful information in term of NLP. If you want to remove number you can apply this method seperately on output of sigma_cleaner.

Contact Information

Feel free to reach out to me on social media:

GitHub LinkedIn Twitter Facebook

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_prettifier-1.1.0.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

text_prettifier-1.1.0-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file text_prettifier-1.1.0.tar.gz.

File metadata

  • Download URL: text_prettifier-1.1.0.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for text_prettifier-1.1.0.tar.gz
Algorithm Hash digest
SHA256 da7d2be294bef77e7b9ed172b5a7c35f7648a42ba45691857557deddaf7303a3
MD5 fba53404410d344d8857adc76d39794f
BLAKE2b-256 61cd5890da3118ee06086eb34eb8bc314d7f2d9700a464ea8053e18a247f6bd0

See more details on using hashes here.

File details

Details for the file text_prettifier-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for text_prettifier-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 270b32d3b3fa1c43a1210271c3ab61b5335c85c8a77a80ef19d97d7dea762ed0
MD5 10d9cb1b8e6e93ffe7b76f37c34d6da2
BLAKE2b-256 98a78be924f058971dbf10104ce1edc1fb6a8b4f4d64feafdaa09a6edbbc08e7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page