A Python package for automating text preprocessing tasks.
Project description
TPTK: Text Preprocessing Toolkit
TPTK (Text Preprocessing Toolkit) is a Python library designed for text preprocessing in Natural Language Processing (NLP). It offers a comprehensive set of tools to clean, tokenize, lemmatize, and preprocess text efficiently. The library allows users to use specific preprocessing steps or execute a pipeline for end-to-end text preprocessing.
Features
- Text Cleaning: Remove punctuation, special characters, URLs, and HTML tags.
- Tokenization: Convert text into individual tokens.
- Lemmatization: Reduce words to their base forms using WordNet.
- Spell Correction: Detect and correct misspelled words.
- Stopword Removal: Filter out common stopwords.
- Customizable Pipelines: Define the sequence of preprocessing steps.
- Text Statistics: Summarize text with the
headfunction. - Modular and user-friendly design.
Installation
Install the package and its dependencies:
pip install tptk
Getting Started
Importing the Library
from TextPreprocessingToolkit.tptk import TextPreprocessor
Usage Guide
1. Initialize the Preprocessor
# Initialize the TextPreprocessor
tp = TextPreprocessor(custom_stopwords=["custom", "words"])
You can provide additional stopwords using the custom_stopwords parameter.
2. Core Functions
Each function targets a specific aspect of preprocessing:
Tokenization
Break text into individual tokens (words).
text = "This is an example sentence."
tokens = tp.tokenize(text)
print(tokens)
# Output: ['This', 'is', 'an', 'example', 'sentence', '.']
Remove Punctuation
Strip punctuation marks.
text = "Hello, world! How's it going?"
cleaned_text = tp.remove_punctuation(text)
print(cleaned_text)
# Output: "Hello world Hows it going"
Stopword Removal
Remove common stopwords from tokenized text.
tokens = ['This', 'is', 'an', 'example']
filtered_tokens = tp.remove_stopwords(tokens)
print(filtered_tokens)
# Output: ['example']
Lemmatization
Reduce words to their base form.
text = "running faster"
lemmatized_text = tp.lemmatize_text(text)
print(lemmatized_text)
# Output: "run fast"
Spell Correction
Correct misspelled words.
text = "Ths is an exampel."
corrected_text = tp.correct_spellings(text)
print(corrected_text)
# Output: "This is an example."
Lowercase Conversion
Standardize text to lowercase.
text = "THIS IS A TEST."
lowercase_text = tp.lowercase(text)
print(lowercase_text)
# Output: "this is a test."
Remove URLs
Eliminate URLs from the text.
text = "Check this link: https://example.com"
url_removed = tp.remove_url(text)
print(url_removed)
# Output: "Check this link"
Remove HTML Tags
Clean out HTML tags.
text = "<div>Hello World!</div>"
cleaned_text = tp.remove_html_tags(text)
print(cleaned_text)
# Output: "Hello World!"
3. Using the Preprocessing Pipeline
Apply multiple preprocessing steps sequentially.
text = "Ths is an <b>example</b> of text preprocessing! Visit https://example.com"
# Apply a preprocessing pipeline
processed_text = tp.preprocess(
text, steps=[
"lowercase",
"remove_url",
"remove_html_tags",
"remove_punctuation",
"correct_spellings",
"lemmatize_text"
]
)
print(processed_text)
# Output: "this example text preprocess"
By default, the pipeline includes:
- Lowercase conversion
- URL removal
- HTML tag removal
- Punctuation removal
- Special character removal
- Spell correction
- Lemmatization
4. Analyze Text Using head
Summarize multiple text entries with head. It displays the original text, preprocessed text, word count, and character count.
texts = [
"Ths is the frst example.",
"Preprocessing is <b>important</b>!",
"Visit https://example.com for details."
]
tp.head(texts, n=3)
Output Table (Rendered in Jupyter Notebook or IPython):
| Original Text | Processed Text | Word Count | Character Count |
|---|---|---|---|
| Ths is the frst example. | this first example | 3 | 19 |
| Preprocessing is important! | preprocess important | 2 | 19 |
| Visit https://example.com for details. | visit details | 2 | 13 |
Custom Stopwords
Add specific stopwords to the default list:
tp = TextPreprocessor(custom_stopwords=["specific", "stopwords"])
Why Use TPTK?
Modular Design: Use only the components you need. Customizable Pipelines: Tailor preprocessing steps to your project's needs. Scalable: Built for small-scale prototypes to production-grade systems. Easy Integration: Compatible with common Python-based NLP workflows.
Author and Credits
This package was developed by Gaurav Jaiswal with a focus on user-friendly text preprocessing solutions for NLP tasks. Contributions, feedback, and suggestions are welcome.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tptk-0.0.9.tar.gz.
File metadata
- Download URL: tptk-0.0.9.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44ce57b41cfcbc8455e5b29a1ccb409d60c58de05879f823a1fd3a3f5e7e6fdb
|
|
| MD5 |
ec236e5f275516d8c64724057cde83d3
|
|
| BLAKE2b-256 |
a53b37f0a3c674d899211d38ada8e1d8c5834c64cf34d3db7be06adce636192e
|
File details
Details for the file TPTK-0.0.9-py3-none-any.whl.
File metadata
- Download URL: TPTK-0.0.9-py3-none-any.whl
- Upload date:
- Size: 5.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
834d178460f615b038b6e72c219bee940ebffa7abebe0e5cf0d25ffeacef3282
|
|
| MD5 |
fd4b02c775ae5e65744fed3b142de962
|
|
| BLAKE2b-256 |
30c84a22bc371fb9789821155c743491012de1fb496666d9d30b447cdadd0576
|