Skip to main content

A Python package for automating text preprocessing tasks.

Project description

TPTK: Text Preprocessing Toolkit

TPTK (Text Preprocessing Toolkit) is a Python library designed for text preprocessing in Natural Language Processing (NLP). It offers a comprehensive set of tools to clean, tokenize, lemmatize, and preprocess text efficiently. The library allows users to use specific preprocessing steps or execute a pipeline for end-to-end text preprocessing.


Features

  • Text Cleaning: Remove punctuation, special characters, URLs, and HTML tags.
  • Tokenization: Convert text into individual tokens.
  • Lemmatization: Reduce words to their base forms using WordNet.
  • Spell Correction: Detect and correct misspelled words.
  • Stopword Removal: Filter out common stopwords.
  • Customizable Pipelines: Define the sequence of preprocessing steps.
  • Text Statistics: Summarize text with the head function.
  • Modular and user-friendly design.

Installation

Install the package and its dependencies:

pip install tptk

Getting Started

Importing the Library

from TextPreprocessingToolkit.tptk import TextPreprocessor

Usage Guide

1. Initialize the Preprocessor

# Initialize the TextPreprocessor
tp = TextPreprocessor(custom_stopwords=["custom", "words"])

You can provide additional stopwords using the custom_stopwords parameter.


2. Core Functions

Each function targets a specific aspect of preprocessing:

Tokenization

Break text into individual tokens (words).

text = "This is an example sentence."
tokens = tp.tokenize(text)
print(tokens)
# Output: ['This', 'is', 'an', 'example', 'sentence', '.']
Remove Punctuation

Strip punctuation marks.

text = "Hello, world! How's it going?"
cleaned_text = tp.remove_punctuation(text)
print(cleaned_text)
# Output: "Hello world Hows it going"
Stopword Removal

Remove common stopwords from tokenized text.

tokens = ['This', 'is', 'an', 'example']
filtered_tokens = tp.remove_stopwords(tokens)
print(filtered_tokens)
# Output: ['example']
Lemmatization

Reduce words to their base form.

text = "running faster"
lemmatized_text = tp.lemmatize_text(text)
print(lemmatized_text)
# Output: "run fast"
Spell Correction

Correct misspelled words.

text = "Ths is an exampel."
corrected_text = tp.correct_spellings(text)
print(corrected_text)
# Output: "This is an example."
Lowercase Conversion

Standardize text to lowercase.

text = "THIS IS A TEST."
lowercase_text = tp.lowercase(text)
print(lowercase_text)
# Output: "this is a test."
Remove URLs

Eliminate URLs from the text.

text = "Check this link: https://example.com"
url_removed = tp.remove_url(text)
print(url_removed)
# Output: "Check this link"
Remove HTML Tags

Clean out HTML tags.

text = "<div>Hello World!</div>"
cleaned_text = tp.remove_html_tags(text)
print(cleaned_text)
# Output: "Hello World!"

3. Using the Preprocessing Pipeline

Apply multiple preprocessing steps sequentially.

text = "Ths is an <b>example</b> of text preprocessing! Visit https://example.com"

# Apply a preprocessing pipeline
processed_text = tp.preprocess(
    text, steps=[
        "lowercase",
        "remove_url",
        "remove_html_tags",
        "remove_punctuation",
        "correct_spellings",
        "lemmatize_text"
    ]
)
print(processed_text)
# Output: "this example text preprocess"

By default, the pipeline includes:

  • Lowercase conversion
  • URL removal
  • HTML tag removal
  • Punctuation removal
  • Special character removal
  • Spell correction
  • Lemmatization

4. Analyze Text Using head

Summarize multiple text entries with head. It displays the original text, preprocessed text, word count, and character count.

texts = [
    "Ths is the frst example.",
    "Preprocessing is <b>important</b>!",
    "Visit https://example.com for details."
]

tp.head(texts, n=3)

Output Table (Rendered in Jupyter Notebook or IPython):

Original Text Processed Text Word Count Character Count
Ths is the frst example. this first example 3 19
Preprocessing is important! preprocess important 2 19
Visit https://example.com for details. visit details 2 13

Custom Stopwords

Add specific stopwords to the default list:

tp = TextPreprocessor(custom_stopwords=["specific", "stopwords"])

Why Use TPTK?

Modular Design: Use only the components you need. Customizable Pipelines: Tailor preprocessing steps to your project's needs. Scalable: Built for small-scale prototypes to production-grade systems. Easy Integration: Compatible with common Python-based NLP workflows.

Author and Credits

This package was developed by Gaurav Jaiswal with a focus on user-friendly text preprocessing solutions for NLP tasks. Contributions, feedback, and suggestions are welcome.


License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tptk-1.0.0.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

TPTK-1.0.0-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file tptk-1.0.0.tar.gz.

File metadata

  • Download URL: tptk-1.0.0.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for tptk-1.0.0.tar.gz
Algorithm Hash digest
SHA256 994a424c8a993b4810dde6c746ee254c57ac3f14f9d961a412e8bf342c257ede
MD5 5818ad1f810e8c33f2ab0da74204b383
BLAKE2b-256 f8dc781d42127a6f6d3a760b5542fb8df03f04c5534297dbd1f84f91f014f108

See more details on using hashes here.

File details

Details for the file TPTK-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: TPTK-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 5.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for TPTK-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ba4c81a46098737d4222e4708d1f3ad87e20a181001922695742a15dd72fda3e
MD5 dd9850b79416e72fc9f65ce8eaa2c05a
BLAKE2b-256 c682772ecf9d7e4eb565b7553a858899878858703bb91ba6a72e9c779ef9a04c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page