A Python package for automating text preprocessing tasks.

These details have not been verified by PyPI

Project links

Homepage

Project description

Text Preprocessing Toolkit (TPTK)

TPTK is a Python package designed to automate data preprocessing tasks for machine learning and data analysis. It supports text cleaning, numerical data handling (imputation, outlier removal, scaling), and categorical encoding (label or one-hot). The package provides both a programmatic API and a command-line interface (CLI) for ease of use. It processes large datasets in chunks to handle memory efficiently and generates reports on preprocessing steps.

Features

Text Preprocessing: Clean, tokenize, remove stopwords, lemmatize, and spell-check text data.
Numerical Preprocessing: Impute missing values (mean/median), remove outliers (IQR/Z-score), and scale features (standard/min-max).
Categorical Preprocessing: Label encoding or one-hot encoding with support for saving/loading encoders.
Pipeline: Configurable preprocessing pipeline using YAML/JSON files for batch processing CSV files.
Chunked Processing: Handles large datasets by processing in chunks.
Reporting: Generates JSON reports summarizing preprocessing actions.

Installation

From PyPI

Install the package using pip:

pip install TPTK

From Source

Clone the repository and install:

git clone https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit.git
cd Text-Preprocessing-Toolkit
pip install .

During installation, NLTK resources (e.g., stopwords, wordnet) are automatically downloaded.

Dependencies

nltk >= 3.6.0
pyspellchecker >= 0.7.1
pandas >= 1.2.0
scikit-learn (for encoding and scaling)
joblib (for saving encoders)

Quick Start

Step 1: Prepare Your Data

Assume you have a CSV file input.csv with columns like review (text), age (numerical), rating (numerical), gender (categorical).

Example input.csv:

review,age,rating,gender
"This is a great product!",35,4.5,Male
"Bad experience, won't buy again.",,3.0,Female
"Excellent quality.",42,,Male

Step 2: Programmatic Usage

For more control, use the API in your Python scripts.

Example: Text Preprocessing Only

from tptk.text_preprocessor import TextPreprocessor
import pandas as pd

# Download
url = "https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv"
df = pd.read_csv(url)
df = df.head(1000)  # Small sample
df.to_csv(r"imdb_raw.csv", index=False)

# Clean
tp = TextPreprocessor(spell_correction=False)
tp.process_csv(
    input_path=r"imdb_raw.csv",
    text_column="review",
    output_path=r"imdb_clean.csv",
    steps=['clean', 'punctuation', 'lowercase', 'tokenize', 'stopwords', 'lemmatize']
)

Example: Numerical Preprocessing Only

import pandas as pd
from tptk.numerical_preprocessor import NumericalPreprocessor
import seaborn as sns
import matplotlib.pyplot as plt
import os

# If you are downlaoding the dataset
INPUT_DIR = "Input directory path"
OUTPUT_DIR = "Output directory path"

# If you haven't made a input and output dir
os.makedirs(INPUT_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Download
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing(as_frame=True)
df = data.frame.sample(1000, random_state=42)
df.to_csv(f"{INPUT_DIR}/housing_raw.csv", index=False)

# Process
np_prep = NumericalPreprocessor()
df_clean = np_prep.fit_transform(
    df, columns=['MedInc', 'HouseAge', 'AveRooms', 'Population', 'AveOccup'],
    impute="median", scale="standard", remove_outliers="iqr"
)
df_clean.to_csv(f"{OUTPUT_DIR}/housing_clean.csv", index=False)

# Plot
plt.figure(figsize=(10,4))
plt.subplot(1,2,1); sns.boxplot(data=df[['MedInc']]); plt.title("Before")
plt.subplot(1,2,2); sns.boxplot(data=df_clean[['MedInc']]); plt.title("After")
plt.savefig(f"{OUTPUT_DIR}/housing_plot.png")
plt.close()

print("Housing: Done")

Example: Categorical Preprocessing Only

from tptk.categorical_preprocessor import CategoricalPreprocessor
import pandas as pd
import os

# If you are downlaoding the dataset

INPUT_DIR = "Input directory path"
OUTPUT_DIR = "Output directory path"
os.makedirs(INPUT_DIR, exist_ok=True); os.makedirs(OUTPUT_DIR, exist_ok=True)

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df = df[['Pclass', 'Sex', 'Embarked', 'Survived']].dropna().head(500)
df.to_csv(f"{INPUT_DIR}/titanic_raw.csv", index=False)

# Label
label_enc = CategoricalPreprocessor("label", save_dir="../encoders")
label_enc.fit(df, ['Pclass', 'Sex', 'Embarked'])
df_label = label_enc.transform(df, ['Pclass', 'Sex', 'Embarked'])
df_label.to_csv(f"{OUTPUT_DIR}/titanic_label.csv", index=False)

# One-Hot
ohe_enc = CategoricalPreprocessor("onehot", save_dir="../encoders")
ohe_enc.fit(df, ['Pclass', 'Sex', 'Embarked'])
df_ohe = ohe_enc.transform(df, ['Pclass', 'Sex', 'Embarked'])
df_ohe.to_csv(f"{OUTPUT_DIR}/titanic_ohe.csv", index=False)

print("Titanic: Label →", df_label['Sex'].iloc[0], "| OHE →", df_ohe.filter(like='Sex_').columns)

Troubleshooting NLTK Data

TPTK bundles punkt_tab, stopwords, wordnet, and averaged_perceptron_tagger_eng. On first import, it:

Checks bundled data Falls back to nltk.download(..., quiet=True)

If you see LookupError or download fails:

pythonimport nltk
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')

Step 5: View Reports

After processing, check preprocessing_report.json for details like imputed values, outliers removed, etc.

Example Report:

{
  "steps": ["text", "numerical", "categorical"],
  "stats": {
    "numerical": {
      "age": {"imputed_with": 38.5, "outliers_removed": 0},
      "rating": {"imputed_with": 3.75, "outliers_removed": 1}
    }
  }
}

Development and Testing

Setup: Run ./init_setup.sh to create a virtual environment and install dev dependencies.

Linting and Testing: Use tox or manually:

flake8 src/
mypy src/
pytest -v tests/unit
pytest -v tests/integration

Build Package: python setup.py sdist bdist_wheel

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions, contact Gaurav Jaiswal.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.4

Nov 13, 2025

1.0.3

Nov 9, 2025

1.0.1

Dec 8, 2024

1.0.0

Dec 8, 2024

0.0.9

Dec 8, 2024

0.0.8

Dec 7, 2024

0.0.7

Dec 5, 2024

0.0.6

Dec 4, 2024

0.0.5

Dec 4, 2024

0.0.4

Dec 4, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tptk-1.0.4.tar.gz (11.8 kB view details)

Uploaded Nov 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tptk-1.0.4-py3-none-any.whl (12.3 kB view details)

Uploaded Nov 13, 2025 Python 3

File details

Details for the file tptk-1.0.4.tar.gz.

File metadata

Download URL: tptk-1.0.4.tar.gz
Upload date: Nov 13, 2025
Size: 11.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for tptk-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`ea7c029f0dad344b9d668f5929204c5f2208b30d9b2d40ebe6e5c8ac58f79fbd`
MD5	`c2cdda1d41642b0e4adba5bd7a114873`
BLAKE2b-256	`8e517d3377a86c838adebb93c6a9c1239c60e300ab0655fe3c8c7f0e5df1947e`

See more details on using hashes here.

File details

Details for the file tptk-1.0.4-py3-none-any.whl.

File metadata

Download URL: tptk-1.0.4-py3-none-any.whl
Upload date: Nov 13, 2025
Size: 12.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for tptk-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4377666fa8dbc1bd819ebc7e40c254658ec37e1bc70a53f94d7dea8ab810c673`
MD5	`dd6c796a0787d924b8c8b18dd876d008`
BLAKE2b-256	`4d9b9ba802fca19853f6d3347eee71e51a3d7a1d86d34d0d2f1a7a5797cdab4d`

See more details on using hashes here.

TPTK 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Text Preprocessing Toolkit (TPTK)

Features

Installation

From PyPI

From Source

Dependencies

Quick Start

Step 1: Prepare Your Data

Step 2: Programmatic Usage

Example: Text Preprocessing Only

Example: Numerical Preprocessing Only

Example: Categorical Preprocessing Only

Troubleshooting NLTK Data

If you see LookupError or download fails:

Step 5: View Reports

Development and Testing

Contributing

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes