Skip to main content

Text Preprocessing package for NLP projects

Project description

PrepGem

PrepGem is a Python package for preprocessing text data, designed to simplify the text-cleaning process for natural language processing (NLP) projects.

Features

PrepGem offers the following features:

  • Handle Missing Values: Easily handle missing values in specified DataFrame columns.
  • Clean HTML Text: Remove HTML tags and special characters from text or DataFrame columns.
  • Remove URLs: Remove URLs from text or DataFrame columns.
  • Remove Punctuation: Remove punctuation from text or DataFrame columns.
  • Remove Emojis: Remove emojis from text or DataFrame columns.
  • Remove Foreign Letters: Remove foreign letters from text or DataFrame columns.
  • Remove Numbers: Remove numbers from text or DataFrame columns.
  • Lowercasing: Convert text to lowercase in text or DataFrame columns.
  • Remove White Spaces: Remove extra white spaces from text or DataFrame columns.
  • Remove Repeated Characters: Remove repeated characters in words from text or DataFrame columns.
  • Remove Nonsense Words: Remove nonsense words from text or DataFrame columns.
  • Spell Correction: Perform spell-checking on text or DataFrame columns.
  • Nonsense Words and Spell Check: Perform spell-checking and remove nonsense words from text or DataFrame columns.
  • Tokenize: Tokenize text using NLTK's word_tokenize function.
  • Remove Stopwords: Remove stopwords from text tokens.
  • Stemming: Perform stemming on text tokens.

Installation

You can install PrepGem via pip:

pip install prepgem 

Usage

Importing the module python

import prepgem 

Basic Usage

text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text)

Preprocessing a single text

text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_single_text(text)
print(cleaned_text)

Preprocessing a DataFrame

import pandas as pd

# Create a sample DataFrame
data = {
    'text_column': ["This is an example text.", "Another example text with numbers: 12345."]
}
df = pd.DataFrame(data)

# Preprocess text column in the DataFrame
cleaned_df = prepgem.preprocess_dataframe(df, columns=['text_column'])
print(cleaned_df)

Default preprocessing pipeline

Default available preprocessing step is:

  • clean_html_text.
  • remove_urls
  • remove_punctuation
  • remove_emojis
  • remove_foreign_letters
  • remove_numbers
  • lowercasing
  • remove_white_spaces
  • remove_repeated_characters
  • nosense_words_and_spell_check
  • tokenize
  • remove_stopwords
  • stemming
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text) 

Custom preprocessing pipeline

You can customize the preprocessing steps by passing a list of parameters to the preprocess_text method. Available parameters include:

  • clean_html_text.
  • remove_urls
  • remove_punctuation
  • remove_emojis
  • remove_foreign_letters
  • remove_numbers
  • lowercasing
  • remove_white_spaces
  • remove_repeated_characters
  • remove_nonsense_words
  • spell_corrector
  • nosense_words_and_spell_check
  • tokenize
  • remove_stopwords
  • stemming
  • handle_missing_values
Example usage
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text","nosense_words_and_spell_check"])
print(cleaned_text)

You can customize the preprocessing steps by passing a parameter remove with value of True remove=True to the preprocess_text method to remove a step. Available parameters include:

Example usage
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text"], remove=True)
print(cleaned_text)

You can use all step as normal function just by passing The text or DataFrame containing the text column to be cleaned

from prepgem import remove_urls

# Example text with URLs
text_with_urls = "This is an example text with URLs: https://example.com and http://www.example.org."

# Remove URLs from the text

cleaned_text = remove_urls(text_with_urls)

print("Original text:")
print(text_with_urls)
print("\nText after removing URLs:")
print(cleaned_text)

This will output:

Original text:
This is an example text with URLs: https://example.com and http://www.example.org.

Text after removing URLs:
This is an example text with URLs:  and .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PrepGem-1.0.6.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

PrepGem-1.0.6-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file PrepGem-1.0.6.tar.gz.

File metadata

  • Download URL: PrepGem-1.0.6.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for PrepGem-1.0.6.tar.gz
Algorithm Hash digest
SHA256 283b89290c28479530476f66b570303100784f7fc45ebca27d634d9f770a698b
MD5 9a6e75e3ea69e92f5def311bb6b9b3a9
BLAKE2b-256 e4968a76844283f6fc47b64dd5a1e62112d1416effad24d2b05063310e1f59c2

See more details on using hashes here.

File details

Details for the file PrepGem-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: PrepGem-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for PrepGem-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 c8a6cb5cd049b9eb744638810f9d6724380f0d9299822aac2129dbf865082c5a
MD5 5347492393fa281aef7ce056d36b32da
BLAKE2b-256 d89666e670c810e8476012d523079673f386e7874a5101f65d5632f683f39ae2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page