Skip to main content

A Python package for automating text preprocessing tasks.

Project description

TPTK: Text Preprocessing Toolkit

TPTK (Text Preprocessing Toolkit) is a Python library designed for text preprocessing in Natural Language Processing (NLP). It offers a comprehensive set of tools to clean, tokenize, lemmatize, and preprocess text efficiently. The library allows users to use specific preprocessing steps or execute a pipeline for end-to-end text preprocessing.


Features

  • Text Cleaning: Remove punctuation, special characters, URLs, and HTML tags.
  • Tokenization: Convert text into individual tokens.
  • Lemmatization: Reduce words to their base forms using WordNet.
  • Spell Correction: Detect and correct misspelled words.
  • Stopword Removal: Filter out common stopwords.
  • Customizable Pipelines: Define the sequence of preprocessing steps.
  • Text Statistics: Summarize text with the head function.
  • Modular and user-friendly design.

Installation

Install the package and its dependencies:

pip install tptk

Getting Started

Importing the Library

from TextPreprocessingToolkit.tptk import TextPreprocessor

Usage Guide

1. Initialize the Preprocessor

# Initialize the TextPreprocessor
tp = TextPreprocessor(custom_stopwords=["custom", "words"])

You can provide additional stopwords using the custom_stopwords parameter.


2. Core Functions

Each function targets a specific aspect of preprocessing:

Tokenization

Break text into individual tokens (words).

text = "This is an example sentence."
tokens = tp.tokenize(text)
print(tokens)
# Output: ['This', 'is', 'an', 'example', 'sentence', '.']
Remove Punctuation

Strip punctuation marks.

text = "Hello, world! How's it going?"
cleaned_text = tp.remove_punctuation(text)
print(cleaned_text)
# Output: "Hello world Hows it going"
Stopword Removal

Remove common stopwords from tokenized text.

tokens = ['This', 'is', 'an', 'example']
filtered_tokens = tp.remove_stopwords(tokens)
print(filtered_tokens)
# Output: ['example']
Lemmatization

Reduce words to their base form.

text = "running faster"
lemmatized_text = tp.lemmatize_text(text)
print(lemmatized_text)
# Output: "run fast"
Spell Correction

Correct misspelled words.

text = "Ths is an exampel."
corrected_text = tp.correct_spellings(text)
print(corrected_text)
# Output: "This is an example."
Lowercase Conversion

Standardize text to lowercase.

text = "THIS IS A TEST."
lowercase_text = tp.lowercase(text)
print(lowercase_text)
# Output: "this is a test."
Remove URLs

Eliminate URLs from the text.

text = "Check this link: https://example.com"
url_removed = tp.remove_url(text)
print(url_removed)
# Output: "Check this link"
Remove HTML Tags

Clean out HTML tags.

text = "<div>Hello World!</div>"
cleaned_text = tp.remove_html_tags(text)
print(cleaned_text)
# Output: "Hello World!"

3. Using the Preprocessing Pipeline

Apply multiple preprocessing steps sequentially.

text = "Ths is an <b>example</b> of text preprocessing! Visit https://example.com"

# Apply a preprocessing pipeline
processed_text = tp.preprocess(
    text, steps=[
        "lowercase",
        "remove_url",
        "remove_html_tags",
        "remove_punctuation",
        "correct_spellings",
        "lemmatize_text"
    ]
)
print(processed_text)
# Output: "this example text preprocess"

By default, the pipeline includes:

  • Lowercase conversion
  • URL removal
  • HTML tag removal
  • Punctuation removal
  • Special character removal
  • Spell correction
  • Lemmatization

4. Analyze Text Using head

Summarize multiple text entries with head. It displays the original text, preprocessed text, word count, and character count.

texts = [
    "Ths is the frst example.",
    "Preprocessing is <b>important</b>!",
    "Visit https://example.com for details."
]

tp.head(texts, n=3)

Output Table (Rendered in Jupyter Notebook or IPython):

Original Text Processed Text Word Count Character Count
Ths is the frst example. this first example 3 19
Preprocessing is important! preprocess important 2 19
Visit https://example.com for details. visit details 2 13

Custom Stopwords

Add specific stopwords to the default list:

tp = TextPreprocessor(custom_stopwords=["specific", "stopwords"])

Why Use TPTK?

Modular Design: Use only the components you need. Customizable Pipelines: Tailor preprocessing steps to your project's needs. Scalable: Built for small-scale prototypes to production-grade systems. Easy Integration: Compatible with common Python-based NLP workflows.

Author and Credits

This package was developed by Gaurav Jaiswal with a focus on user-friendly text preprocessing solutions for NLP tasks. Contributions, feedback, and suggestions are welcome.


License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tptk-0.0.9.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

TPTK-0.0.9-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file tptk-0.0.9.tar.gz.

File metadata

  • Download URL: tptk-0.0.9.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for tptk-0.0.9.tar.gz
Algorithm Hash digest
SHA256 44ce57b41cfcbc8455e5b29a1ccb409d60c58de05879f823a1fd3a3f5e7e6fdb
MD5 ec236e5f275516d8c64724057cde83d3
BLAKE2b-256 a53b37f0a3c674d899211d38ada8e1d8c5834c64cf34d3db7be06adce636192e

See more details on using hashes here.

File details

Details for the file TPTK-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: TPTK-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 5.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for TPTK-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 834d178460f615b038b6e72c219bee940ebffa7abebe0e5cf0d25ffeacef3282
MD5 fd4b02c775ae5e65744fed3b142de962
BLAKE2b-256 30c84a22bc371fb9789821155c743491012de1fb496666d9d30b447cdadd0576

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page