A Python library for cleaning and preprocessing text data by removing,emojies,internet words, special characters, digits, HTML tags, URLs, and stopwords.
Project description
TextPrettifier
TextPrettifier is a Python library for cleaning text data by removing HTML tags, URLs, numbers, special characters, contractions, and stopwords.
TextPrettifier Key Features
1. Removing Emojis
The remove_emojis
method removes emojis from the text.
2. Removing Internet Words
The remove_internet_words
method removes internet-specific words from the text.
3. Removing HTML Tags
The remove_html_tags
method removes HTML tags from the text.
4. Removing URLs
The remove_urls
method removes URLs from the text.
5. Removing Numbers
The remove_numbers
method removes numbers from the text.
6. Removing Special Characters
The remove_special_chars
method removes special characters from the text.
7. Expanding Contractions
The remove_contractions
method expands contractions in the text.
8. Removing Stopwords
The remove_stopwords
method removes stopwords from the text.
Additional Functionality
- If
is_lower
andis_token
are bothTrue
, the text is returned in lowercase and as a list of tokens. - If only
is_lower
isTrue
, the text is returned in lowercase. - If only
is_token
isTrue
, the text is returned as a list of tokens. - If neither
is_lower
noris_token
isTrue
, the text is returned as is.
Installation
You can install TextPrettifier using pip:
pip install text-prettifier
from text_prettifier import TextPrettifier
Initialize TextPrettifier
text_prettifier = TextPrettifier()
Example: Remove Emojis
html_text = "Hi,Pythonogist! I ❤️ Python."
cleaned_html = text_prettifier.remove_emojis(html_text)
print(cleaned_html)
Output Hi,Pythonogist! I Python.
Example: Remove HTML tags
html_text = "<p>Hello, <b>world</b>!</p>"
cleaned_html = text_prettifier.remove_html_tags(html_text)
print(cleaned_html)
Output Hello,world!
Example: Remove URLs
url_text = "Visit our website at https://example.com"
cleaned_urls = text_prettifier.remove_urls(url_text)
print(cleaned_urls)
Output Visit our webiste at
Example: Remove numbers
number_text = "There are 123 apples"
cleaned_numbers = text_prettifier.remove_numbers(number_text)
print(cleaned_numbers)
Output There are apples
Example: Remove special characters
special_text = "Hello, @world!"
cleaned_special = text_prettifier.remove_special_chars(special_text)
print(cleaned_special)
Output Hello world
Example: Remove contractions
contraction_text = "I can't do it"
cleaned_contractions = text_prettifier.remove_contractions(contraction_text)
print(cleaned_contractions)
Output I cannot do it
Example: Remove stopwords
stopwords_text = "This is a test"
cleaned_stopwords = text_prettifier.remove_stopwords(stopwords_text)
print(cleaned_stopwords)
Output This test
Example: Apply all cleaning methods
all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text)
print(all_cleaned)
Output Hello world 123 apples cannot test
If you are interested to tokenized and lower the cleaned text write the code
all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text,is_token=True,is_lower=True)
print(all_cleaned)
Output ['Hello','world', '123','apples', 'cannot','test']
Note: I didn't include remove_numbers
in sigma_cleaner
because sometimes numbers carry useful information in term of NLP. If you want to remove number you can apply this method seperately on output of sigma_cleaner
.
Contact Information
Feel free to reach out to me on social media:
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file text_prettifier-1.1.2.tar.gz
.
File metadata
- Download URL: text_prettifier-1.1.2.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39b8586509df72808caacb62caacd814a25d6b770b119faac9d3377acc26c86d |
|
MD5 | 216e28a469c8f0619582974b7e481ca8 |
|
BLAKE2b-256 | 394d721f56106195013fbc2af396a4d6b8c27ca55f16547d55c9d241c2a3214c |
File details
Details for the file text_prettifier-1.1.2-py3-none-any.whl
.
File metadata
- Download URL: text_prettifier-1.1.2-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d116c6bffbc800930ade65847e717b3c6b0911ca29b17e87c3d67691b4afd999 |
|
MD5 | efc289f7abc2a7fabcab242e84bbeb70 |
|
BLAKE2b-256 | 36e7d4d978480c265eb100649420178a54bc6ff2a0d14af825b2bed3f50141ad |