Text Preprocessing package for NLP projects
Project description
PrepGem
PrepGem is a Python package for preprocessing text data, designed to simplify the text-cleaning process for natural language processing (NLP) projects.
Features
PrepGem offers the following features:
- Handle Missing Values: Easily handle missing values in specified DataFrame columns.
- Clean HTML Text: Remove HTML tags and special characters from text or DataFrame columns.
- Remove URLs: Remove URLs from text or DataFrame columns.
- Remove Punctuation: Remove punctuation from text or DataFrame columns.
- Remove Emojis: Remove emojis from text or DataFrame columns.
- Remove Foreign Letters: Remove foreign letters from text or DataFrame columns.
- Remove Numbers: Remove numbers from text or DataFrame columns.
- Lowercasing: Convert text to lowercase in text or DataFrame columns.
- Remove White Spaces: Remove extra white spaces from text or DataFrame columns.
- Remove Repeated Characters: Remove repeated characters in words from text or DataFrame columns.
- Remove Nonsense Words: Remove nonsense words from text or DataFrame columns.
- Spell Correction: Perform spell-checking on text or DataFrame columns.
- Nonsense Words and Spell Check: Perform spell-checking and remove nonsense words from text or DataFrame columns.
- Tokenize: Tokenize text using NLTK's word_tokenize function.
- Remove Stopwords: Remove stopwords from text tokens.
- Stemming: Perform stemming on text tokens.
Installation
You can install PrepGem via pip:
pip install prepgem
Usage
Importing the module python
import prepgem
Basic Usage
text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text)
Preprocessing a single text
text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_single_text(text)
print(cleaned_text)
Preprocessing a DataFrame
import pandas as pd
# Create a sample DataFrame
data = {
'text_column': ["This is an example text.", "Another example text with numbers: 12345."]
}
df = pd.DataFrame(data)
# Preprocess text column in the DataFrame
cleaned_df = prepgem.preprocess_dataframe(df, columns=['text_column'])
print(cleaned_df)
Default preprocessing pipeline
Default available preprocessing step is:
- clean_html_text.
- remove_urls
- remove_punctuation
- remove_emojis
- remove_foreign_letters
- remove_numbers
- lowercasing
- remove_white_spaces
- remove_repeated_characters
- nosense_words_and_spell_check
- tokenize
- remove_stopwords
- stemming
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text)
Custom preprocessing pipeline
You can customize the preprocessing steps by passing a list of parameters to the preprocess_text method. Available parameters include:
- clean_html_text.
- remove_urls
- remove_punctuation
- remove_emojis
- remove_foreign_letters
- remove_numbers
- lowercasing
- remove_white_spaces
- remove_repeated_characters
- remove_nonsense_words
- spell_corrector
- nosense_words_and_spell_check
- tokenize
- remove_stopwords
- stemming
- handle_missing_values
Example usage
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text","nosense_words_and_spell_check"])
print(cleaned_text)
You can customize the preprocessing steps by passing a parameter remove with value of True remove=True to the preprocess_text method to remove a step. Available parameters include:
Example usage
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text"], remove=True)
print(cleaned_text)
You can use all step as normal function just by passing The text or DataFrame containing the text column to be cleaned
from prepgem import remove_urls
# Example text with URLs
text_with_urls = "This is an example text with URLs: https://example.com and http://www.example.org."
# Remove URLs from the text
cleaned_text = remove_urls(text_with_urls)
print("Original text:")
print(text_with_urls)
print("\nText after removing URLs:")
print(cleaned_text)
This will output:
Original text:
This is an example text with URLs: https://example.com and http://www.example.org.
Text after removing URLs:
This is an example text with URLs: and .
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.