Text Preprocessing package for NLP projects

These details have not been verified by PyPI

Project links

Homepage

Project description

PrepGem

PrepGem is a Python package for preprocessing text data, designed to simplify the text-cleaning process for natural language processing (NLP) projects.

Features

PrepGem offers the following features:

Handle Missing Values: Easily handle missing values in specified DataFrame columns.
Clean HTML Text: Remove HTML tags and special characters from text or DataFrame columns.
Remove URLs: Remove URLs from text or DataFrame columns.
Remove Punctuation: Remove punctuation from text or DataFrame columns.
Remove Emojis: Remove emojis from text or DataFrame columns.
Remove Foreign Letters: Remove foreign letters from text or DataFrame columns.
Remove Numbers: Remove numbers from text or DataFrame columns.
Lowercasing: Convert text to lowercase in text or DataFrame columns.
Remove White Spaces: Remove extra white spaces from text or DataFrame columns.
Remove Repeated Characters: Remove repeated characters in words from text or DataFrame columns.
Remove Nonsense Words: Remove nonsense words from text or DataFrame columns.
Spell Correction: Perform spell-checking on text or DataFrame columns.
Nonsense Words and Spell Check: Perform spell-checking and remove nonsense words from text or DataFrame columns.
Tokenize: Tokenize text using NLTK's word_tokenize function.
Remove Stopwords: Remove stopwords from text tokens.
Stemming: Perform stemming on text tokens.

Installation

You can install PrepGem via pip:

pip install prepgem

Usage

Importing the module python

import prepgem

Basic Usage

text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text)

Preprocessing a single text

text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_single_text(text)
print(cleaned_text)

Preprocessing a DataFrame

import pandas as pd

# Create a sample DataFrame
data = {
    'text_column': ["This is an example text.", "Another example text with numbers: 12345."]
}
df = pd.DataFrame(data)

# Preprocess text column in the DataFrame
cleaned_df = prepgem.preprocess_dataframe(df, columns=['text_column'])
print(cleaned_df)

Default preprocessing pipeline

Default available preprocessing step is:

clean_html_text.
remove_urls
remove_punctuation
remove_emojis
remove_foreign_letters
remove_numbers
lowercasing
remove_white_spaces
remove_repeated_characters
nosense_words_and_spell_check
tokenize
remove_stopwords
stemming

text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text)

Custom preprocessing pipeline

You can customize the preprocessing steps by passing a list of parameters to the preprocess_text method. Available parameters include:

clean_html_text.
remove_urls
remove_punctuation
remove_emojis
remove_foreign_letters
remove_numbers
lowercasing
remove_white_spaces
remove_repeated_characters
remove_nonsense_words
spell_corrector
nosense_words_and_spell_check
tokenize
remove_stopwords
stemming
handle_missing_values

Example usage

text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text","nosense_words_and_spell_check"])
print(cleaned_text)

You can customize the preprocessing steps by passing a parameter remove with value of True remove=True to the preprocess_text method to remove a step. Available parameters include:

Example usage

text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text"], remove=True)
print(cleaned_text)

You can use all step as normal function just by passing The text or DataFrame containing the text column to be cleaned

from prepgem import remove_urls

# Example text with URLs
text_with_urls = "This is an example text with URLs: https://example.com and http://www.example.org."

# Remove URLs from the text

cleaned_text = remove_urls(text_with_urls)

print("Original text:")
print(text_with_urls)
print("\nText after removing URLs:")
print(cleaned_text)

This will output:

Original text:
This is an example text with URLs: https://example.com and http://www.example.org.

Text after removing URLs:
This is an example text with URLs:  and .

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.6

Mar 22, 2024

1.0.5

Mar 22, 2024

1.0.4

Mar 21, 2024

1.0.3

Mar 19, 2024

This version

1.0.2

Mar 18, 2024

1.0.1

Feb 23, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PrepGem-1.0.2.tar.gz (6.2 kB view hashes)

Uploaded Mar 18, 2024 Source

Built Distribution

PrepGem-1.0.2-py3-none-any.whl (7.0 kB view hashes)

Uploaded Mar 18, 2024 Python 3

Hashes for PrepGem-1.0.2.tar.gz

Hashes for PrepGem-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`b819ce2e82e0a410eb59b351671d14c68bda03c0500e9fb874339e129950d219`
MD5	`d4ccc1739c093afaeb6827e81f759e78`
BLAKE2b-256	`935ecb76c588f6dd22cefae313e3decc3208cafed9324323908cbc3680eb9756`

Hashes for PrepGem-1.0.2-py3-none-any.whl

Hashes for PrepGem-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6856c91bc2cbf7b0251f8aa0fa833c767412c80571cc31941f9e3c56c8e473e4`
MD5	`5a1eb02a96e6867d08a275b99a773a3e`
BLAKE2b-256	`3c67b777fc56298df132f8f40b857f12ea15ee52a6bd1dd4d78f6aec11602918`