A Python package for automating text preprocessing tasks.

These details have not been verified by PyPI

Project links

Homepage

Project description

TextPreprocessor

TextPreprocessor is a comprehensive Python library for text preprocessing in NLP tasks. It includes a suite of features such as tokenization, punctuation removal, stopword removal, lemmatization, spell correction, and more. This package is designed to streamline and simplify text preprocessing for data analysis, machine learning, and natural language processing projects.

Features

Tokenization
Stopword removal (with customizable stopwords)
Punctuation removal
Special character removal
URL and HTML tag removal
Lowercasing
Lemmatization (WordNet-based)
Spell correction
Modular preprocessing pipeline

Installation

Ensure the following dependencies are installed:

Python 3.7 or higher
Required Python packages:
```
pip install nltk pyspellchecker pandas
```

Additionally, download required NLTK resources:

import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

Usage

Initialization

Import the TextPreprocessing class and initialize it. Optionally, pass a list of custom stopwords:

from textPreprocessingToolkit import TextPreprocessor

# Initialize with optional custom stopwords
tpt = TextPreprocessor(custom_stopwords=["example", "test"])

Preprocessing Text

Preprocess a single piece of text by specifying a sequence of preprocessing steps:

text = "Hello! This is an <b>example</b> sentence. Visit https://example.com for more info!"
processed_text = tpt.preprocess(
    text, 
    steps=[
        "lowercase",
        "remove_punctuation",
        "remove_special_characters",
        "remove_url",
        "remove_html_tags",
        "correct_spellings",
        "lemmatize_text"
    ]
)
print("Processed Text:", processed_text)

Output:

Processed Text: hello this is an example sentence visit for more info

Batch Processing

You can preprocess a batch of texts and view a summary:

texts = [
    "NLP preprocessing includes tokenization, lemmatization, and stemming.",
    "Special characters like @, $, %, &, should be removed!",
    "Spelling erorrs in this sentense should be fixed.",
]
tpt.head(texts, n=3)

This will display a table (in Jupyter or IPython environments) with the following columns:

Original Text
Processed Text
Word Count
Character Count

Modular Methods

You can also use individual methods for specific preprocessing tasks:

text = "Check for spelling erorrs in this sentense."
print("Tokenized:", tpt.tokenize(text))
print("Spell-corrected:", tpt.correct_spellings(text))
print("Lemmatized:", tpt.lemmatize_text(text))

Class Documentation

`TextPreprocessor`

Initialization:

TextPreprocessor(custom_stopwords: Optional[List[str]] = None)

custom_stopwords: (Optional) A list of additional stopwords to remove.

Methods:

preprocess(text: str, steps: Optional[List[str]] = None) -> str
- Preprocesses the input text according to the specified pipeline steps.
tokenize(text: str) -> List[str]
- Tokenizes text into words.
remove_punctuation(text: str) -> str
- Removes punctuation from the text.
remove_stopwords(tokens: List[str]) -> List[str]
- Removes stopwords from a tokenized list.
remove_special_characters(text: str) -> str
- Removes non-alphanumeric characters from the text.
correct_spellings(text: str) -> str
- Corrects misspellings in the text.
lemmatize_text(text: str) -> str
- Lemmatizes text using WordNet.

Logging

The package includes built-in logging for debugging and tracking progress. Logs are displayed for each preprocessing step completed.

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a feature branch (git checkout -b feature-branch).
Commit your changes (git commit -m 'Add feature').
Push to the branch (git push origin feature-branch).
Open a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Developed by [Your Name]. Feel free to reach out for suggestions or collaboration!

Feedback

If you encounter any issues or have suggestions for improvement, please open an issue on GitHub or contact jaiswalgaurav863@gmail.com.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.4

Nov 13, 2025

1.0.3

Nov 9, 2025

1.0.1

Dec 8, 2024

1.0.0

Dec 8, 2024

0.0.9

Dec 8, 2024

This version

0.0.8

Dec 7, 2024

0.0.7

Dec 5, 2024

0.0.6

Dec 4, 2024

0.0.5

Dec 4, 2024

0.0.4

Dec 4, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tptk-0.0.8.tar.gz (5.5 kB view details)

Uploaded Dec 7, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

TPTK-0.0.8-py3-none-any.whl (5.1 kB view details)

Uploaded Dec 7, 2024 Python 3

File details

Details for the file tptk-0.0.8.tar.gz.

File metadata

Download URL: tptk-0.0.8.tar.gz
Upload date: Dec 7, 2024
Size: 5.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for tptk-0.0.8.tar.gz
Algorithm	Hash digest
SHA256	`1da05bb681234fbfa263c5a2a2bd025c7924b64a0a292eeb72608315ef0990dd`
MD5	`c3136140e35c4a1be35d6a67e33a17a0`
BLAKE2b-256	`93a6e7235e2adb2db391dd4e38b46cd461214d3508f6b557bfced17ad2f58ba7`

See more details on using hashes here.

File details

Details for the file TPTK-0.0.8-py3-none-any.whl.

File metadata

Download URL: TPTK-0.0.8-py3-none-any.whl
Upload date: Dec 7, 2024
Size: 5.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for TPTK-0.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4aff156a39279e2dab3cc33f66c358cef98db3165a9bc969870599541b31c344`
MD5	`05e42f80cd949934af8988d96fb1104d`
BLAKE2b-256	`51c0ed15bb16ee893308eca872b7d0a3300e97995a5a110ff25dee1d2f01a312`

See more details on using hashes here.

TPTK 0.0.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

TextPreprocessor

Features

Installation

Usage

Initialization

Preprocessing Text

Batch Processing

Modular Methods

Class Documentation

TextPreprocessor

Initialization:

Methods:

Logging

Contributing

License

Author

Feedback

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`TextPreprocessor`