A Python package for automating text preprocessing tasks.

These details have not been verified by PyPI

Project links

Homepage

Project description

TPTK: Text Preprocessing Toolkit

TPTK (Text Preprocessing Toolkit) is a Python library designed for text preprocessing in Natural Language Processing (NLP). It offers a comprehensive set of tools to clean, tokenize, lemmatize, and preprocess text efficiently. The library allows users to use specific preprocessing steps or execute a pipeline for end-to-end text preprocessing.

Features

Text Cleaning: Remove punctuation, special characters, URLs, and HTML tags.
Tokenization: Convert text into individual tokens.
Lemmatization: Reduce words to their base forms using WordNet.
Spell Correction: Detect and correct misspelled words.
Stopword Removal: Filter out common stopwords.
Customizable Pipelines: Define the sequence of preprocessing steps.
Text Statistics: Summarize text with the head function.
Modular and user-friendly design.

Installation

Install the package and its dependencies:

pip install tptk

Getting Started

Importing the Library

from TextPreprocessingToolkit.tptk import TextPreprocessor

Usage Guide

1. Initialize the Preprocessor

# Initialize the TextPreprocessor
tp = TextPreprocessor(custom_stopwords=["custom", "words"])

You can provide additional stopwords using the custom_stopwords parameter.

2. Core Functions

Each function targets a specific aspect of preprocessing:

Tokenization

Break text into individual tokens (words).

text = "This is an example sentence."
tokens = tp.tokenize(text)
print(tokens)
# Output: ['This', 'is', 'an', 'example', 'sentence', '.']

Remove Punctuation

Strip punctuation marks.

text = "Hello, world! How's it going?"
cleaned_text = tp.remove_punctuation(text)
print(cleaned_text)
# Output: "Hello world Hows it going"

Stopword Removal

Remove common stopwords from tokenized text.

tokens = ['This', 'is', 'an', 'example']
filtered_tokens = tp.remove_stopwords(tokens)
print(filtered_tokens)
# Output: ['example']

Lemmatization

Reduce words to their base form.

text = "running faster"
lemmatized_text = tp.lemmatize_text(text)
print(lemmatized_text)
# Output: "run fast"

Spell Correction

Correct misspelled words.

text = "Ths is an exampel."
corrected_text = tp.correct_spellings(text)
print(corrected_text)
# Output: "This is an example."

Lowercase Conversion

Standardize text to lowercase.

text = "THIS IS A TEST."
lowercase_text = tp.lowercase(text)
print(lowercase_text)
# Output: "this is a test."

Remove URLs

Eliminate URLs from the text.

text = "Check this link: https://example.com"
url_removed = tp.remove_url(text)
print(url_removed)
# Output: "Check this link"

Remove HTML Tags

Clean out HTML tags.

text = "<div>Hello World!</div>"
cleaned_text = tp.remove_html_tags(text)
print(cleaned_text)
# Output: "Hello World!"

3. Using the Preprocessing Pipeline

Apply multiple preprocessing steps sequentially.

text = "Ths is an <b>example</b> of text preprocessing! Visit https://example.com"

# Apply a preprocessing pipeline
processed_text = tp.preprocess(
    text, steps=[
        "lowercase",
        "remove_url",
        "remove_html_tags",
        "remove_punctuation",
        "correct_spellings",
        "lemmatize_text"
    ]
)
print(processed_text)
# Output: "this example text preprocess"

By default, the pipeline includes:

Lowercase conversion
URL removal
HTML tag removal
Punctuation removal
Special character removal
Spell correction
Lemmatization

4. Analyze Text Using `head`

Summarize multiple text entries with head. It displays the original text, preprocessed text, word count, and character count.

texts = [
    "Ths is the frst example.",
    "Preprocessing is <b>important</b>!",
    "Visit https://example.com for details."
]

tp.head(texts, n=3)

Output Table (Rendered in Jupyter Notebook or IPython):

Original Text	Processed Text	Word Count	Character Count
Ths is the frst example.	this first example	3	19
Preprocessing is important!	preprocess important	2	19
Visit https://example.com for details.	visit details	2	13

Custom Stopwords

Add specific stopwords to the default list:

tp = TextPreprocessor(custom_stopwords=["specific", "stopwords"])

Why Use TPTK?

Modular Design: Use only the components you need. Customizable Pipelines: Tailor preprocessing steps to your project's needs. Scalable: Built for small-scale prototypes to production-grade systems. Easy Integration: Compatible with common Python-based NLP workflows.

Author and Credits

This package was developed by Gaurav Jaiswal with a focus on user-friendly text preprocessing solutions for NLP tasks. Contributions, feedback, and suggestions are welcome.

License

This project is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.4

Nov 13, 2025

1.0.3

Nov 9, 2025

1.0.1

Dec 8, 2024

This version

1.0.0

Dec 8, 2024

0.0.9

Dec 8, 2024

0.0.8

Dec 7, 2024

0.0.7

Dec 5, 2024

0.0.6

Dec 4, 2024

0.0.5

Dec 4, 2024

0.0.4

Dec 4, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tptk-1.0.0.tar.gz (6.1 kB view details)

Uploaded Dec 8, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

TPTK-1.0.0-py3-none-any.whl (5.4 kB view details)

Uploaded Dec 8, 2024 Python 3

File details

Details for the file tptk-1.0.0.tar.gz.

File metadata

Download URL: tptk-1.0.0.tar.gz
Upload date: Dec 8, 2024
Size: 6.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for tptk-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`994a424c8a993b4810dde6c746ee254c57ac3f14f9d961a412e8bf342c257ede`
MD5	`5818ad1f810e8c33f2ab0da74204b383`
BLAKE2b-256	`f8dc781d42127a6f6d3a760b5542fb8df03f04c5534297dbd1f84f91f014f108`

See more details on using hashes here.

File details

Details for the file TPTK-1.0.0-py3-none-any.whl.

File metadata

Download URL: TPTK-1.0.0-py3-none-any.whl
Upload date: Dec 8, 2024
Size: 5.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for TPTK-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ba4c81a46098737d4222e4708d1f3ad87e20a181001922695742a15dd72fda3e`
MD5	`dd9850b79416e72fc9f65ce8eaa2c05a`
BLAKE2b-256	`c682772ecf9d7e4eb565b7553a858899878858703bb91ba6a72e9c779ef9a04c`

See more details on using hashes here.

TPTK 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

TPTK: Text Preprocessing Toolkit

Features

Installation

Getting Started

Importing the Library

Usage Guide

1. Initialize the Preprocessor

2. Core Functions

Tokenization

Remove Punctuation

Stopword Removal

Lemmatization

Spell Correction

Lowercase Conversion

Remove URLs

Remove HTML Tags

3. Using the Preprocessing Pipeline

4. Analyze Text Using head

Custom Stopwords

Why Use TPTK?

Author and Credits

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

4. Analyze Text Using `head`