Skip to main content

A Python package for automating text preprocessing tasks.

Project description

Text Preprocessing Toolkit (TPTK)

PyPI version License: MIT Python Version

TPTK is a Python package designed to automate data preprocessing tasks for machine learning and data analysis. It supports text cleaning, numerical data handling (imputation, outlier removal, scaling), and categorical encoding (label or one-hot). The package provides both a programmatic API and a command-line interface (CLI) for ease of use. It processes large datasets in chunks to handle memory efficiently and generates reports on preprocessing steps.

Features

  • Text Preprocessing: Clean, tokenize, remove stopwords, lemmatize, and spell-check text data.
  • Numerical Preprocessing: Impute missing values (mean/median), remove outliers (IQR/Z-score), and scale features (standard/min-max).
  • Categorical Preprocessing: Label encoding or one-hot encoding with support for saving/loading encoders.
  • Pipeline: Configurable preprocessing pipeline using YAML/JSON files for batch processing CSV files.
  • Chunked Processing: Handles large datasets by processing in chunks.
  • Reporting: Generates JSON reports summarizing preprocessing actions.
  • CLI Support: Run preprocessing via command-line arguments.
  • Extensible: Modular classes for custom workflows.

Installation

From PyPI

Install the package using pip:

pip install TPTK

From Source

Clone the repository and install:

git clone https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit.git
cd Text-Preprocessing-Toolkit
pip install .

During installation, NLTK resources (e.g., stopwords, wordnet) are automatically downloaded.

Dependencies

  • nltk >= 3.6.0
  • pyspellchecker >= 0.7.1
  • pandas >= 1.2.0
  • scikit-learn (for encoding and scaling)
  • joblib (for saving encoders)

For development/testing:

  • pytest, flake8, mypy, etc. (install via pip install -r requirements_dev.txt)

Quick Start

Step 1: Prepare Your Data

Assume you have a CSV file input.csv with columns like review (text), age (numerical), rating (numerical), gender (categorical).

Example input.csv:

review,age,rating,gender
"This is a great product!",35,4.5,Male
"Bad experience, won't buy again.",,3.0,Female
"Excellent quality.",42,,Male

Step 2: Create a Configuration File (Optional but Recommended for Pipeline)

Create a YAML or JSON config file (e.g., pipeline_example.yaml):

text:
  column: review
  steps: [clean, tokenize, stopwords, lemmatize]
  spell: false
numerical:
  columns: [age, rating]
  impute: median
  scale: standard
  outliers: iqr
categorical:
  columns: [gender]
  type: onehot
  • Text Section: Specify the text column and steps (clean, tokenize, stopwords, lemmatize, spell).
  • Numerical Section: List columns, imputation strategy, scaling, and outlier removal.
  • Categorical Section: List columns and encoding type (label or onehot).

Step 3: Run Preprocessing via CLI

Use the CLI to preprocess your CSV file:

dataprepkit preprocess --input input.csv --output output.csv --config pipeline_example.yaml --chunksize 5000
  • --input: Path to input CSV.
  • --output: Path to output CSV.
  • --config: Path to YAML/JSON config (optional; if omitted, use --text for text-only processing).
  • --chunksize: Process in chunks of this size (default: 10000).
  • --text: Text column name (for text-only mode).
  • --steps: Text processing steps (default: clean, tokenize, lemmatize).

This will apply the pipeline, save the processed data to output.csv, and generate a preprocessing_report.json.

Example Output (output.csv after processing):

review,age,rating,gender_Female,gender_Male
great product,35.0,4.5,0.0,1.0
bad experience wont buy,0.0,3.0,1.0,0.0
excellent quality,42.0,0.0,0.0,1.0

(Note: Numerical values are imputed/scaled, text is processed, categoricals are encoded.)

Step 4: Programmatic Usage

For more control, use the API in your Python scripts.

Example: Full Pipeline

from TPTK.pipeline import PreprocessingPipeline

# Initialize with config
pipeline = PreprocessingPipeline('pipeline_example.yaml')

# Fit and transform CSV
pipeline.fit_transform('input.csv', 'output.csv', chunksize=10000)

# Get report
report = pipeline.report
print(report)

Example: Text Preprocessing Only

from TPTK.text_preprocessor import TextPreprocessor
import pandas as pd

# Download
url = "https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv"
df = pd.read_csv(url)
df = df.head(1000)  # Small sample
df.to_csv(r"imdb_raw.csv", index=False)

# Clean
tp = TextPreprocessor(spell_correction=False)
tp.process_csv(
    input_path=r"imdb_raw.csv",
    text_column="review",
    output_path=r"imdb_clean.csv",
    steps=['clean', 'punctuation', 'lowercase', 'tokenize', 'stopwords', 'lemmatize']
)

Example: Numerical Preprocessing Only

import pandas as pd
from TPTK.numerical_preprocessor import NumericalPreprocessor
import seaborn as sns
import matplotlib.pyplot as plt
import os

# If you are downlaoding the dataset
INPUT_DIR = "Input directory path"
OUTPUT_DIR = "Output directory path"

# If you haven't made a input and output dir
os.makedirs(INPUT_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Download
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing(as_frame=True)
df = data.frame.sample(1000, random_state=42)
df.to_csv(f"{INPUT_DIR}/housing_raw.csv", index=False)

# Process
np_prep = NumericalPreprocessor()
df_clean = np_prep.fit_transform(
    df, columns=['MedInc', 'HouseAge', 'AveRooms', 'Population', 'AveOccup'],
    impute="median", scale="standard", remove_outliers="iqr"
)
df_clean.to_csv(f"{OUTPUT_DIR}/housing_clean.csv", index=False)

# Plot
plt.figure(figsize=(10,4))
plt.subplot(1,2,1); sns.boxplot(data=df[['MedInc']]); plt.title("Before")
plt.subplot(1,2,2); sns.boxplot(data=df_clean[['MedInc']]); plt.title("After")
plt.savefig(f"{OUTPUT_DIR}/housing_plot.png")
plt.close()

print("Housing: Done")
print(np_prep.report)

Example: Categorical Preprocessing Only

from TPTK.categorical_preprocessor import CategoricalPreprocessor
import pandas as pd
import os

# If you are downlaoding the dataset

INPUT_DIR = "Input directory path"
OUTPUT_DIR = "Output directory path"
os.makedirs(INPUT_DIR, exist_ok=True); os.makedirs(OUTPUT_DIR, exist_ok=True)

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df = df[['Pclass', 'Sex', 'Embarked', 'Survived']].dropna().head(500)
df.to_csv(f"{INPUT_DIR}/titanic_raw.csv", index=False)

# Label
label_enc = CategoricalPreprocessor("label", save_dir="../encoders")
label_enc.fit(df, ['Pclass', 'Sex', 'Embarked'])
df_label = label_enc.transform(df, ['Pclass', 'Sex', 'Embarked'])
df_label.to_csv(f"{OUTPUT_DIR}/titanic_label.csv", index=False)

# One-Hot
ohe_enc = CategoricalPreprocessor("onehot", save_dir="../encoders")
ohe_enc.fit(df, ['Pclass', 'Sex', 'Embarked'])
df_ohe = ohe_enc.transform(df, ['Pclass', 'Sex', 'Embarked'])
df_ohe.to_csv(f"{OUTPUT_DIR}/titanic_ohe.csv", index=False)

print("Titanic: Label →", df_label['Sex'].iloc[0], "| OHE →", df_ohe.filter(like='Sex_').columns)

Step 5: View Reports

After processing, check preprocessing_report.json for details like imputed values, outliers removed, etc.

Example Report:

{
  "steps": ["text", "numerical", "categorical"],
  "stats": {
    "numerical": {
      "age": {"imputed_with": 38.5, "outliers_removed": 0},
      "rating": {"imputed_with": 3.75, "outliers_removed": 1}
    }
  }
}

Development and Testing

  • Setup: Run ./init_setup.sh to create a virtual environment and install dev dependencies.
  • Linting and Testing: Use tox or manually:
    flake8 src/
    mypy src/
    pytest -v tests/unit
    pytest -v tests/integration
    
  • Build Package: python setup.py sdist bdist_wheel

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions, contact Gaurav Jaiswal.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tptk-1.0.3.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tptk-1.0.3-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file tptk-1.0.3.tar.gz.

File metadata

  • Download URL: tptk-1.0.3.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for tptk-1.0.3.tar.gz
Algorithm Hash digest
SHA256 192cbfca98530bb161699a4b24fec8bda86f1df138fad54e015133590c8345b6
MD5 2a6dc5ff0ed347d6c9da7d2cc8d80d5f
BLAKE2b-256 a6c5f8e8f597f6acbf3447db6be7594215a6ad848de796bd93e155015151ddbe

See more details on using hashes here.

File details

Details for the file tptk-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: tptk-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for tptk-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9e6222e2094173e5b1ffa6b8194df1255efe4f24bdbc870bbb88879aa8bf5227
MD5 bf9fbc983388d151dc6e245f20b83b2e
BLAKE2b-256 0eb4a60a7bf1f4ab467275fcd0347e293d9bba3294ba0045cc9a33f3e5aa769d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page