A package for cleaning and preprocessing text data
Project description
purifytext
purifytext is a flexible and customizable text preprocessing package designed to clean and prepare text data for natural language processing (NLP) tasks. It includes various text preprocessing steps such as removing HTML tags, lowercasing text, removing URLs, emojis, punctuation, special characters, numbers, expanding contractions, removing stopwords, stemming, and lemmatizing.
Installation
You can install purifytext and its dependencies using pip:
pip install purifytext
This command will automatically install the required dependencies: numpy, pandas, nltk, beautifulsoup4, and contractions.
Additionally, ensure that you download the necessary NLTK data:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
Usage
Here's an example of how to use the purifytext package:
import pandas as pd
from purifytext import clean_text
# Example DataFrame
data = {
'text_column': [
'<p>Hello World! Visit http://example.com 😊</p>',
'Python is awesome!!! 123456'
]
}
df = pd.DataFrame(data)
# Cleaning the text
df_cleaned = clean_text(df, 'text_column', remove_HTML=True, lowercase=True, remove_urls=True,
remove_emojis=True, remove_punctuation=True, remove_special_characters=True,
remove_numbers=True, remove_whitespace=True, expand_contractions=True,
remove_stopwords=True, stemming=False, lemmatizing=True)
print(df_cleaned)
Functions
text_lower(text): Lowercases the input text.remove_punctuation(text): Removes punctuation from the text.remove_number(text): Removes numbers from the text.remove_whitespace(text): Removes extra whitespace from the text.remove_contractions(text): Expands contractions in the text.remove_HTML_tag(text): Removes HTML tags from the text.remove_stopwords(text): Removes stopwords from the text.lemmatize_words(text): Lemmatizes the words in the text.stem_words(text): Stems the words in the text.remove_urls(text): Removes URLs from the text.remove_emojis(text): Removes emojis from the text.remove_special_characters(text): Removes special characters from the text.
Cleaning Options
The clean_text function allows you to specify which cleaning steps to perform using the following parameters (default values are shown):
remove_HTML=True: Remove HTML tags.lowercase=True: Lowercase the text.remove_urls=True: Remove URLs.remove_emojis=True: Remove emojis.remove_punctuation=True: Remove punctuation.remove_special_characters=True: Remove special characters.remove_numbers=True: Remove numbers.remove_whitespace=True: Remove extra whitespace.expand_contractions=True: Expand contractions.remove_stopwords=True: Remove stopwords.stemming=False: Stem words.lemmatizing=True: Lemmatize words.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Author
This package is developed by Aman Kumar Jha. For any questions or suggestions, please contact us at vats.amankumarjha2002@gmail.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file purifytext-0.1.0.tar.gz.
File metadata
- Download URL: purifytext-0.1.0.tar.gz
- Upload date:
- Size: 4.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ab4e76d34d780e2bff91d30c2fe83db96cb85bc4bd8bbb4e26b332d0acbf68b
|
|
| MD5 |
c7fbffb02ae6d03402eefebbc37e2432
|
|
| BLAKE2b-256 |
4d74aafc549ee6bd23f7479d3f9468dafbac456294d8ebbfece44eb0fe9df88c
|