Text Preprocessing package for NLP projects
Project description
PrepGem
PrepGem is a Python package for preprocessing text data, designed to simplify the text-cleaning process for natural language processing (NLP) projects.
Features
PrepGem offers the following features:
- Handle Missing Values: Easily handle missing values in specified DataFrame columns.
- Clean HTML Text: Remove HTML tags and special characters from text or DataFrame columns.
- Remove URLs: Remove URLs from text or DataFrame columns.
- Remove Punctuation: Remove punctuation from text or DataFrame columns.
- Remove Emojis: Remove emojis from text or DataFrame columns.
- Remove Foreign Letters: Remove foreign letters from text or DataFrame columns.
- Remove Numbers: Remove numbers from text or DataFrame columns.
- Lowercasing: Convert text to lowercase in text or DataFrame columns.
- Remove White Spaces: Remove extra white spaces from text or DataFrame columns.
- Remove Repeated Characters: Remove repeated characters in words from text or DataFrame columns.
- Remove Nonsense Words: Remove nonsense words from text or DataFrame columns.
- Spell Correction: Perform spell-checking on text or DataFrame columns.
- Nonsense Words and Spell Check: Perform spell-checking and remove nonsense words from text or DataFrame columns.
- Tokenize: Tokenize text using NLTK's word_tokenize function.
- Remove Stopwords: Remove stopwords from text tokens.
- Stemming: Perform stemming on text tokens.
Installation
You can install PrepGem via pip:
pip install prepgem
Usage
Importing the module python
import prepgem
Basic Usage
text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text)
Preprocessing a single text
text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_single_text(text)
print(cleaned_text)
Preprocessing a DataFrame
import pandas as pd
# Create a sample DataFrame
data = {
'text_column': ["This is an example text.", "Another example text with numbers: 12345."]
}
df = pd.DataFrame(data)
# Preprocess text column in the DataFrame
cleaned_df = prepgem.preprocess_dataframe(df, columns=['text_column'])
print(cleaned_df)
Default preprocessing pipeline
Default available preprocessing step is:
- clean_html_text.
- remove_urls
- remove_punctuation
- remove_emojis
- remove_foreign_letters
- remove_numbers
- lowercasing
- remove_white_spaces
- remove_repeated_characters
- nosense_words_and_spell_check
- tokenize
- remove_stopwords
- stemming
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text)
Custom preprocessing pipeline
You can customize the preprocessing steps by passing a list of parameters to the preprocess_text method. Available parameters include:
- clean_html_text.
- remove_urls
- remove_punctuation
- remove_emojis
- remove_foreign_letters
- remove_numbers
- lowercasing
- remove_white_spaces
- remove_repeated_characters
- remove_nonsense_words
- spell_corrector
- nosense_words_and_spell_check
- tokenize
- remove_stopwords
- stemming
- handle_missing_values
Example usage
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text","nosense_words_and_spell_check"])
print(cleaned_text)
You can customize the preprocessing steps by passing a parameter remove with value of True remove=True to the preprocess_text method to remove a step. Available parameters include:
Example usage
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text"], remove=True)
print(cleaned_text)
You can use all step as normal function just by passing The text or DataFrame containing the text column to be cleaned
from prepgem import remove_urls
# Example text with URLs
text_with_urls = "This is an example text with URLs: https://example.com and http://www.example.org."
# Remove URLs from the text
cleaned_text = remove_urls(text_with_urls)
print("Original text:")
print(text_with_urls)
print("\nText after removing URLs:")
print(cleaned_text)
This will output:
Original text:
This is an example text with URLs: https://example.com and http://www.example.org.
Text after removing URLs:
This is an example text with URLs: and .
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file PrepGem-1.0.6.tar.gz
.
File metadata
- Download URL: PrepGem-1.0.6.tar.gz
- Upload date:
- Size: 6.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 283b89290c28479530476f66b570303100784f7fc45ebca27d634d9f770a698b |
|
MD5 | 9a6e75e3ea69e92f5def311bb6b9b3a9 |
|
BLAKE2b-256 | e4968a76844283f6fc47b64dd5a1e62112d1416effad24d2b05063310e1f59c2 |
File details
Details for the file PrepGem-1.0.6-py3-none-any.whl
.
File metadata
- Download URL: PrepGem-1.0.6-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c8a6cb5cd049b9eb744638810f9d6724380f0d9299822aac2129dbf865082c5a |
|
MD5 | 5347492393fa281aef7ce056d36b32da |
|
BLAKE2b-256 | d89666e670c810e8476012d523079673f386e7874a5101f65d5632f683f39ae2 |