Skip to main content

Text Preprocessing package for NLP projects

Project description

PrepGem

PrepGem is a Python package for preprocessing text data, designed to simplify the text-cleaning process for natural language processing (NLP) projects.

Features

PrepGem offers the following features:

  • Handle Missing Values: Easily handle missing values in specified DataFrame columns.
  • Clean HTML Text: Remove HTML tags and special characters from text or DataFrame columns.
  • Remove URLs: Remove URLs from text or DataFrame columns.
  • Remove Punctuation: Remove punctuation from text or DataFrame columns.
  • Remove Emojis: Remove emojis from text or DataFrame columns.
  • Remove Foreign Letters: Remove foreign letters from text or DataFrame columns.
  • Remove Numbers: Remove numbers from text or DataFrame columns.
  • Lowercasing: Convert text to lowercase in text or DataFrame columns.
  • Remove White Spaces: Remove extra white spaces from text or DataFrame columns.
  • Remove Repeated Characters: Remove repeated characters in words from text or DataFrame columns.
  • Remove Nonsense Words: Remove nonsense words from text or DataFrame columns.
  • Spell Correction: Perform spell-checking on text or DataFrame columns.
  • Nonsense Words and Spell Check: Perform spell-checking and remove nonsense words from text or DataFrame columns.
  • Tokenize: Tokenize text using NLTK's word_tokenize function.
  • Remove Stopwords: Remove stopwords from text tokens.
  • Stemming: Perform stemming on text tokens.

Installation

You can install PrepGem via pip:

pip install prepgem 

Usage

Importing the module python

import prepgem 

Basic Usage

text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text)

Preprocessing a single text

text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_single_text(text)
print(cleaned_text)

Preprocessing a DataFrame

import pandas as pd

# Create a sample DataFrame
data = {
    'text_column': ["This is an example text.", "Another example text with numbers: 12345."]
}
df = pd.DataFrame(data)

# Preprocess text column in the DataFrame
cleaned_df = prepgem.preprocess_dataframe(df, columns=['text_column'])
print(cleaned_df)

Default preprocessing pipeline

Default available preprocessing step is:

  • clean_html_text.
  • remove_urls
  • remove_punctuation
  • remove_emojis
  • remove_foreign_letters
  • remove_numbers
  • lowercasing
  • remove_white_spaces
  • remove_repeated_characters
  • nosense_words_and_spell_check
  • tokenize
  • remove_stopwords
  • stemming
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text) 

Custom preprocessing pipeline

You can customize the preprocessing steps by passing a list of parameters to the preprocess_text method. Available parameters include:

  • clean_html_text.
  • remove_urls
  • remove_punctuation
  • remove_emojis
  • remove_foreign_letters
  • remove_numbers
  • lowercasing
  • remove_white_spaces
  • remove_repeated_characters
  • remove_nonsense_words
  • spell_corrector
  • nosense_words_and_spell_check
  • tokenize
  • remove_stopwords
  • stemming
  • handle_missing_values
Example usage
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text","nosense_words_and_spell_check"])
print(cleaned_text)

You can customize the preprocessing steps by passing a parameter remove with value of True remove=True to the preprocess_text method to remove a step. Available parameters include:

Example usage
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text"], remove=True)
print(cleaned_text)

You can use all step as normal function just by passing The text or DataFrame containing the text column to be cleaned

from prepgem import remove_urls

# Example text with URLs
text_with_urls = "This is an example text with URLs: https://example.com and http://www.example.org."

# Remove URLs from the text

cleaned_text = remove_urls(text_with_urls)

print("Original text:")
print(text_with_urls)
print("\nText after removing URLs:")
print(cleaned_text)

This will output:

Original text:
This is an example text with URLs: https://example.com and http://www.example.org.

Text after removing URLs:
This is an example text with URLs:  and .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PrepGem-1.0.4.tar.gz (6.2 kB view hashes)

Uploaded Source

Built Distribution

PrepGem-1.0.4-py3-none-any.whl (7.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page