Skip to main content

A Python library for automated preprocessing of mixed numeric, categorical, and text data.

Project description

Dapropy

A lightweight Python library for automated preprocessing of datasets containing mixed numeric, categorical, and text features. It cleans text, handles missing values, encodes categories, scales numerics, and persists transformers for consistent training and inference.


✨ Features

  • Mixed-type handling: numeric, categorical, and text columns in one pipeline
  • Missing values: simple fill strategies or mixed-type KNN imputation
  • Categorical encoding: Label encoding or One-Hot encoding
  • Scaling: standardization for numeric features
  • Text processing: clean HTML/URLs/emojis, lowercase, spell-correct, remove stopwords, stem, then Bag-of-Words
  • Optional data cleanup: basic inconsistency fixes, partial outlier capping, partial noise reduction
  • Persistence: saves encoders, vectorizer, scalers, imputer, and feature order for inference reproducibility

📦 Installation

pip install git+https://github.com/BlackIIIWhite/Dapropy

Python 3.8+ is required.

Note: On first use, NLTK resources are downloaded automatically (e.g., punkt, stopwords).


🚀 Quickstart

import pandas as pd
from Dapropy.Dapropy import Dapropy

# Example data
train_df = pd.DataFrame({
    "age": [25, 30, None, 40],
    "city": ["NY", "LA", "LA", None],
    "review": [
        "Loved it! 😊 <b>Great</b> service",
        "Okay visit, would return",
        None,
        "Terrible... won't go again http://example.com"
    ],
    "label": [1, 0, 1, 0]
})

# Fit-time processing (saves transformers into ./transformers by default)
p = Dapropy(
    target="label",
    strategyED="Label",           # or "One-Hot"
    imputer_strategy="KNN",       # or fill/remove strategies
    enable_text_processing=True,
    strategyNLP="bag_of_words",
    fix_datainconsistencies=False,
    partialnoisereduction=False,
    partialcap_outliersiqr=False,
    folder_name="transformers"
)
X_train = p.full_process(train_df)

# Inference-time processing (reuses saved transformers)
new_df = pd.DataFrame({
    "age": [28],
    "city": ["LA"],
    "review": ["Service was fine, nothing special"],
})
X_infer = p.pipeline(new_df)

⚙️ Configuration

  • target: name of the target column (kept unscaled in output)
  • strategyED: categorical encoding strategy
    • "Label" (default)
    • "One-Hot"
  • imputer_strategy: missing value handling
    • "KNN" (default) – mixed-type KNN with safe label encoding/decoding
    • "remove", "fillna_mean", "fillna_median", "fillna_mode"
  • cap_ratio: fraction of detected outliers to cap (0–1)
  • smooth_ratio: fraction of detected noise to smooth (0–1)
  • window_size: rolling window for smoothing
  • enable_text_processing: enable/disable text cleaning and vectorization
  • strategyNLP: text representation, currently "bag_of_words"
  • fix_datainconsistencies: normalize common string representations, convert dates, drop duplicates
  • partialnoisereduction: apply optional smoothing to noisy numeric series
  • partialcap_outliersiqr: optionally cap a portion of IQR outliers
  • folder_name: directory to store transformers and metadata

🧠 What happens under the hood

Fit-time (full_process):

  1. Optional basic cleanup (fix_data_inconsistencies)
  2. Missing values (handle_missing_values, default KNN)
  3. Text preprocessing + Bag-of-Words (text_processing)
  4. Categorical encoding (encodingcategorical)
  5. Optional noise/outlier handling
  6. Scaling (scaling) for numerics
  7. Persist artifacts: encoders.pkl, vectorizer.pkl, scalers.pkl, imputer.pkl, feature_order.pkl

Inference-time (pipeline):

  1. Load persisted artifacts
  2. Same transformations as fit-time (without refitting)
  3. Align columns to saved feature_order (missing columns filled with 0)

📚 API Reference

  • class Dapropy(target=None, strategyED='Label', imputer_strategy='KNN', cap_ratio=0.9, smooth_ratio=0.9, window_size=3, enable_text_processing=True, strategyNLP='bag_of_words', fix_datainconsistencies=False, partialnoisereduction=False, partialcap_outliersiqr=False, folder_name='transformers')

    • Creates a preprocessing pipeline instance.
  • full_process(data: pd.DataFrame) -> pd.DataFrame

    • Runs the full fit-time pipeline and saves transformers/metadata.
  • pipeline(data: Union[dict, pd.Series, pd.DataFrame]) -> pd.DataFrame

    • Transforms new data for inference using saved artifacts.
  • handle_missing_values(data, strategy=None, n_neighbors=5)

    • Strategies: KNN, remove, fillna_mean, fillna_median, fillna_mode.
  • encodingcategorical(data, strategyED=None, fit_mode=True)

    • Encodes object/category columns via Label or One-Hot.
  • scaling(data, target=None, fit_mode=True)

    • Standardizes numeric columns; preserves target column values.
  • text_processing(data, column, fit_mode=True)

    • Cleans and vectorizes one text column; called internally for all text columns when enabled.
  • partial_cap_outliers_iqr(data, cap_ratio=None, random_state=42)

    • Caps a portion of IQR-defined outliers.
  • partial_noise_reduction(data, target=None, smooth_ratio=None, window_size=None, random_state=0)

    • Smooths a portion of detected noisy points in numeric series.

📁 Persistence details

Artifacts are saved to folder_name (default: transformers):

  • encoders.pkl, vectorizer.pkl, scalers.pkl, imputer.pkl, feature_order.pkl

You may delete this folder to reset the pipeline or change folder_name to maintain multiple versions.


✅ Tips

  • Ensure all columns besides target that contain free text are of dtype object so they are cleaned/vectorized when enable_text_processing=True.
  • For stable inference, keep the same preprocessing configuration and folder_name between training and serving.
  • If you pass a dict or pd.Series to pipeline, it will be converted to a one-row DataFrame.

🔧 Development

  • Python: 3.8+
  • Key dependencies: pandas, numpy, scikit-learn, emoji, nltk, textblob, joblib
  • Install locally for development:
pip install -e .

📝 License

MIT License. See LICENSE.txt for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dapropy-0.1.0.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dapropy-0.1.0-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file dapropy-0.1.0.tar.gz.

File metadata

  • Download URL: dapropy-0.1.0.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for dapropy-0.1.0.tar.gz
Algorithm Hash digest
SHA256 63e869a0342cf28d89cd8c5e346da2eb16ab0e0042b75a13b51001084883a01f
MD5 09afe6b853c4bf2c9b5ac10d09f1e617
BLAKE2b-256 119e1080b1dffd416b4c64cc6d668b101dd365abfd4405a765e2d55ab65623a8

See more details on using hashes here.

File details

Details for the file dapropy-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dapropy-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for dapropy-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c8857dfda012768a66afbb8866dbbf41f20631cf6502d0bcafc6bf1770fa4b54
MD5 7d85c22c849a81fcfcc6a5f4e3d0570d
BLAKE2b-256 fda45b60bcc9dabedc6d7caf610ed5d1d007fcf2c6e9c89c74cfaa21917ac079

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page