A Python library for cleaning and preprocessing text data by removing,emojies,internet words, special characters, digits, HTML tags, URLs, and stopwords.
Project description
TextPrettifier
TextPrettifier is a Python library for cleaning text data by removing HTML tags, URLs, numbers, special characters, contractions, and stopwords.
TextPrettifier Key Features
1. Removing Emojis
The remove_emojis
method removes emojis from the text.
2. Removing Internet Words
The remove_internet_words
method removes internet-specific words from the text.
3. Removing HTML Tags
The remove_html_tags
method removes HTML tags from the text.
4. Removing URLs
The remove_urls
method removes URLs from the text.
5. Removing Numbers
The remove_numbers
method removes numbers from the text.
6. Removing Special Characters
The remove_special_chars
method removes special characters from the text.
7. Expanding Contractions
The remove_contractions
method expands contractions in the text.
8. Removing Stopwords
The remove_stopwords
method removes stopwords from the text.
Additional Functionality
- If
is_lower
andis_token
are bothTrue
, the text is returned in lowercase and as a list of tokens. - If only
is_lower
isTrue
, the text is returned in lowercase. - If only
is_token
isTrue
, the text is returned as a list of tokens. - If neither
is_lower
noris_token
isTrue
, the text is returned as is.
Installation
You can install TextPrettifier using pip:
pip install text-prettifier
from text_prettifier import TextPrettifier
Initialize TextPrettifier
text_prettifier = TextPrettifier()
Example: Remove Emojis
html_text = "Hi,Pythonogist! I ❤️ Python."
cleaned_html = text_prettifier.remove_emojis(html_text)
print(cleaned_html)
Output Hi,Pythonogist! I Python.
Example: Remove HTML tags
html_text = "<p>Hello, <b>world</b>!</p>"
cleaned_html = text_prettifier.remove_html_tags(html_text)
print(cleaned_html)
Output Hello,world!
Example: Remove URLs
url_text = "Visit our website at https://example.com"
cleaned_urls = text_prettifier.remove_urls(url_text)
print(cleaned_urls)
Output Visit our webiste at
Example: Remove numbers
number_text = "There are 123 apples"
cleaned_numbers = text_prettifier.remove_numbers(number_text)
print(cleaned_numbers)
Output There are apples
Example: Remove special characters
special_text = "Hello, @world!"
cleaned_special = text_prettifier.remove_special_chars(special_text)
print(cleaned_special)
Output Hello world
Example: Remove contractions
contraction_text = "I can't do it"
cleaned_contractions = text_prettifier.remove_contractions(contraction_text)
print(cleaned_contractions)
Output I cannot do it
Example: Remove stopwords
stopwords_text = "This is a test"
cleaned_stopwords = text_prettifier.remove_stopwords(stopwords_text)
print(cleaned_stopwords)
Output This test
Example: Apply all cleaning methods
all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text)
print(all_cleaned)
Output Hello world 123 apples cannot test
If you are interested to tokenized and lower the cleaned text write the code
all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text,is_token=True,is_lower=True)
print(all_cleaned)
Output ['Hello','world', '123','apples', 'cannot','test']
Note: I didn't include remove_numbers
in sigma_cleaner
because sometimes numbers carry useful information in term of NLP. If you want to remove number you can apply this method seperately on output of sigma_cleaner
.
Contact Information
Feel free to reach out to me on social media:
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for text_prettifier-1.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | adb8ab6085683dcd473a9ae95cb9cf78cda3b04e6eee8d79745500ce2a32a70e |
|
MD5 | 0f2f6dbfd0fc277015a3e32b3d6e10c4 |
|
BLAKE2b-256 | 311ddf11af23e5a92d1fa7e72bbe09ab8aa97d5a8be0e5514cba4e19fa0b6ba6 |