Skip to main content

A package that automates text preprocessing

Project description

Text Preprocessing Toolkit (TPT)

Version: 0.0.1
Author: Gaurav Jaiswal
A comprehensive Python toolkit for preprocessing text, designed to simplify NLP workflows. This package provides various utilities like stopword removal, punctuation handling, spell-checking, lemmatization, and more to clean and preprocess text effectively.


Features

  • Remove Punctuation: Strips punctuation marks from text.
  • Remove Stopwords: Removes common stopwords to reduce noise in textual data.
  • Remove Special Characters: Cleans text by removing unnecessary symbols.
  • Lowercase Conversion: Standardizes text to lowercase.
  • Spell Correction: Identifies and corrects misspelled words.
  • Lemmatization: Converts words to their base forms.
  • Stemming: Reduces words to their root forms using a stemming algorithm.
  • HTML Tag Removal: Cleans HTML tags from the text.
  • URL Removal: Detects and removes URLs.
  • Customizable Pipeline: Allows users to apply preprocessing steps in a specified order.
  • Quick Dataset Preview: Provides a summary of text datasets, including word and character counts.

Installation

Clone the repository or install the package using pip:

pip install Text_Preprocessing_Toolkit

Usage

Import the Package

from TPT import TPT

Initialize the Toolkit

You can add custom stopwords during initialization:

tpt = TPT(custom_stopwords=["example", "custom"])

Preprocess Text with Default Pipeline

text = "This is an <b>example</b> sentence with a URL: https://example.com."
processed_text = tpt.preprocess(text)
print(processed_text)

Customize Preprocessing Steps

custom_steps = ["lowercase", "remove_punctuation", "remove_stopwords"]
processed_text = tpt.preprocess(text, steps=custom_steps)
print(processed_text)

Quick Dataset Summary

texts = [
    "This is a sample text.",
    "Another <b>example</b> with HTML tags and a URL: https://example.com.",
    "Spellngg errors corrected!",
]
tpt.head(texts, n=3)

Available Methods

Method Description
remove_punctuation Removes punctuation from text.
remove_stopwords Removes stopwords from text.
remove_special_characters Cleans text by removing special characters.
remove_url Removes URLs from the text.
remove_html_tags Strips HTML tags from text.
correct_spellings Corrects spelling mistakes in the text.
lowercase Converts text to lowercase.
lemmatize_text Lemmatizes text using WordNet.
stem_text Applies stemming to reduce words to their root forms.
preprocess Applies a series of preprocessing steps to the input text.
head Displays a quick summary of a text dataset.

Example Output

Input

This is a <b>sample</b> text with a URL: https://example.com. Check spellngg errors!

Output (Default Pipeline)

sample text check spelling errors

Requirements

  • Python >= 3.8
  • Libraries: nltk, pandas, spellchecker, IPython

To install the dependencies:

pip install -r requirements.txt

Contributing

Contributions are welcome! To contribute:

  1. Fork this repository.
  2. Clone your forked repository.
  3. Create a new branch for your feature.
  4. Make your changes, write tests, and ensure the code passes.
  5. Submit a pull request for review.

Testing

To test the package locally:

  1. Install development dependencies:
    pip install pytest
    
  2. Run tests:
    pytest
    

License

This project is licensed under the MIT License. See the LICENSE file for details.


Author

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_preprocessing_toolkit-0.0.2.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file text_preprocessing_toolkit-0.0.2.tar.gz.

File metadata

File hashes

Hashes for text_preprocessing_toolkit-0.0.2.tar.gz
Algorithm Hash digest
SHA256 0a190537f1f3f8817ff5871bfe1d48d2cf886d46dd36cb2d9ec3c2c4f6b938fe
MD5 a702a1bf7f034f6002b7bf8c2587f9e2
BLAKE2b-256 f4d89efaa351741516d33a2c58965f6605a59ae1875f045a44103b70c4007eb7

See more details on using hashes here.

File details

Details for the file Text_Preprocessing_Toolkit-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for Text_Preprocessing_Toolkit-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 924781da370ceaa75d5b7935e8d30ec446db506684cc7dfa0f4822241201f0ea
MD5 839f8d792e81ff61d5a202f0bb7ddb1f
BLAKE2b-256 e9676b95d8aa23b06e940e3c982339c0e982f0a9747a54a9ef432943f9e8a354

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page