Skip to main content

A Python package for automating text preprocessing tasks.

Project description

TextPreprocessor

TextPreprocessor is a comprehensive Python library for text preprocessing in NLP tasks. It includes a suite of features such as tokenization, punctuation removal, stopword removal, lemmatization, spell correction, and more. This package is designed to streamline and simplify text preprocessing for data analysis, machine learning, and natural language processing projects.


Features

  • Tokenization
  • Stopword removal (with customizable stopwords)
  • Punctuation removal
  • Special character removal
  • URL and HTML tag removal
  • Lowercasing
  • Lemmatization (WordNet-based)
  • Spell correction
  • Modular preprocessing pipeline

Installation

Ensure the following dependencies are installed:

  • Python 3.7 or higher
  • Required Python packages:
    pip install nltk pyspellchecker pandas
    

Additionally, download required NLTK resources:

import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

Usage

Initialization

Import the TextPreprocessing class and initialize it. Optionally, pass a list of custom stopwords:

from textPreprocessingToolkit import TextPreprocessor

# Initialize with optional custom stopwords
tpt = TextPreprocessor(custom_stopwords=["example", "test"])

Preprocessing Text

Preprocess a single piece of text by specifying a sequence of preprocessing steps:

text = "Hello! This is an <b>example</b> sentence. Visit https://example.com for more info!"
processed_text = tpt.preprocess(
    text, 
    steps=[
        "lowercase",
        "remove_punctuation",
        "remove_special_characters",
        "remove_url",
        "remove_html_tags",
        "correct_spellings",
        "lemmatize_text"
    ]
)
print("Processed Text:", processed_text)

Output:

Processed Text: hello this is an example sentence visit for more info

Batch Processing

You can preprocess a batch of texts and view a summary:

texts = [
    "NLP preprocessing includes tokenization, lemmatization, and stemming.",
    "Special characters like @, $, %, &, should be removed!",
    "Spelling erorrs in this sentense should be fixed.",
]
tpt.head(texts, n=3)

This will display a table (in Jupyter or IPython environments) with the following columns:

  • Original Text
  • Processed Text
  • Word Count
  • Character Count

Modular Methods

You can also use individual methods for specific preprocessing tasks:

text = "Check for spelling erorrs in this sentense."
print("Tokenized:", tpt.tokenize(text))
print("Spell-corrected:", tpt.correct_spellings(text))
print("Lemmatized:", tpt.lemmatize_text(text))

Class Documentation

TextPreprocessor

Initialization:

TextPreprocessor(custom_stopwords: Optional[List[str]] = None)
  • custom_stopwords: (Optional) A list of additional stopwords to remove.

Methods:

  • preprocess(text: str, steps: Optional[List[str]] = None) -> str
    • Preprocesses the input text according to the specified pipeline steps.
  • tokenize(text: str) -> List[str]
    • Tokenizes text into words.
  • remove_punctuation(text: str) -> str
    • Removes punctuation from the text.
  • remove_stopwords(tokens: List[str]) -> List[str]
    • Removes stopwords from a tokenized list.
  • remove_special_characters(text: str) -> str
    • Removes non-alphanumeric characters from the text.
  • correct_spellings(text: str) -> str
    • Corrects misspellings in the text.
  • lemmatize_text(text: str) -> str
    • Lemmatizes text using WordNet.

Logging

The package includes built-in logging for debugging and tracking progress. Logs are displayed for each preprocessing step completed.


Contributing

Contributions are welcome! To contribute:

  1. Fork the repository.
  2. Create a feature branch (git checkout -b feature-branch).
  3. Commit your changes (git commit -m 'Add feature').
  4. Push to the branch (git push origin feature-branch).
  5. Open a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.


Author

Developed by [Your Name]. Feel free to reach out for suggestions or collaboration!


Feedback

If you encounter any issues or have suggestions for improvement, please open an issue on GitHub or contact jaiswalgaurav863@gmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tptk-0.0.8.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

TPTK-0.0.8-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file tptk-0.0.8.tar.gz.

File metadata

  • Download URL: tptk-0.0.8.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for tptk-0.0.8.tar.gz
Algorithm Hash digest
SHA256 1da05bb681234fbfa263c5a2a2bd025c7924b64a0a292eeb72608315ef0990dd
MD5 c3136140e35c4a1be35d6a67e33a17a0
BLAKE2b-256 93a6e7235e2adb2db391dd4e38b46cd461214d3508f6b557bfced17ad2f58ba7

See more details on using hashes here.

File details

Details for the file TPTK-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: TPTK-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 5.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for TPTK-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 4aff156a39279e2dab3cc33f66c358cef98db3165a9bc969870599541b31c344
MD5 05e42f80cd949934af8988d96fb1104d
BLAKE2b-256 51c0ed15bb16ee893308eca872b7d0a3300e97995a5a110ff25dee1d2f01a312

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page