A package that automates text preprocessing
Project description
Text Preprocessing Toolkit (TPT)
Version: 0.0.1
Author: Gaurav Jaiswal
A comprehensive Python toolkit for preprocessing text, designed to simplify NLP workflows. This package provides various utilities like stopword removal, punctuation handling, spell-checking, lemmatization, and more to clean and preprocess text effectively.
Features
- Remove Punctuation: Strips punctuation marks from text.
- Remove Stopwords: Removes common stopwords to reduce noise in textual data.
- Remove Special Characters: Cleans text by removing unnecessary symbols.
- Lowercase Conversion: Standardizes text to lowercase.
- Spell Correction: Identifies and corrects misspelled words.
- Lemmatization: Converts words to their base forms.
- Stemming: Reduces words to their root forms using a stemming algorithm.
- HTML Tag Removal: Cleans HTML tags from the text.
- URL Removal: Detects and removes URLs.
- Customizable Pipeline: Allows users to apply preprocessing steps in a specified order.
- Quick Dataset Preview: Provides a summary of text datasets, including word and character counts.
Installation
Clone the repository or install the package using pip
:
pip install Text_Preprocessing_Toolkit
Usage
Import the Package
from TPT import TPT
Initialize the Toolkit
You can add custom stopwords during initialization:
tpt = TPT(custom_stopwords=["example", "custom"])
Preprocess Text with Default Pipeline
text = "This is an <b>example</b> sentence with a URL: https://example.com."
processed_text = tpt.preprocess(text)
print(processed_text)
Customize Preprocessing Steps
custom_steps = ["lowercase", "remove_punctuation", "remove_stopwords"]
processed_text = tpt.preprocess(text, steps=custom_steps)
print(processed_text)
Quick Dataset Summary
texts = [
"This is a sample text.",
"Another <b>example</b> with HTML tags and a URL: https://example.com.",
"Spellngg errors corrected!",
]
tpt.head(texts, n=3)
Available Methods
Method | Description |
---|---|
remove_punctuation |
Removes punctuation from text. |
remove_stopwords |
Removes stopwords from text. |
remove_special_characters |
Cleans text by removing special characters. |
remove_url |
Removes URLs from the text. |
remove_html_tags |
Strips HTML tags from text. |
correct_spellings |
Corrects spelling mistakes in the text. |
lowercase |
Converts text to lowercase. |
lemmatize_text |
Lemmatizes text using WordNet. |
stem_text |
Applies stemming to reduce words to their root forms. |
preprocess |
Applies a series of preprocessing steps to the input text. |
head |
Displays a quick summary of a text dataset. |
Example Output
Input
This is a <b>sample</b> text with a URL: https://example.com. Check spellngg errors!
Output (Default Pipeline)
sample text check spelling errors
Requirements
- Python >= 3.8
- Libraries:
nltk
,pandas
,spellchecker
,IPython
To install the dependencies:
pip install -r requirements.txt
Contributing
Contributions are welcome! To contribute:
- Fork this repository.
- Clone your forked repository.
- Create a new branch for your feature.
- Make your changes, write tests, and ensure the code passes.
- Submit a pull request for review.
Testing
To test the package locally:
- Install development dependencies:
pip install pytest
- Run tests:
pytest
License
This project is licensed under the MIT License. See the LICENSE
file for details.
Author
- Gaurav Jaiswal
GitHub
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file text_preprocessing_toolkit-0.0.2.tar.gz
.
File metadata
- Download URL: text_preprocessing_toolkit-0.0.2.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a190537f1f3f8817ff5871bfe1d48d2cf886d46dd36cb2d9ec3c2c4f6b938fe |
|
MD5 | a702a1bf7f034f6002b7bf8c2587f9e2 |
|
BLAKE2b-256 | f4d89efaa351741516d33a2c58965f6605a59ae1875f045a44103b70c4007eb7 |
File details
Details for the file Text_Preprocessing_Toolkit-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: Text_Preprocessing_Toolkit-0.0.2-py3-none-any.whl
- Upload date:
- Size: 5.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 924781da370ceaa75d5b7935e8d30ec446db506684cc7dfa0f4822241201f0ea |
|
MD5 | 839f8d792e81ff61d5a202f0bb7ddb1f |
|
BLAKE2b-256 | e9676b95d8aa23b06e940e3c982339c0e982f0a9747a54a9ef432943f9e8a354 |