Skip to main content

A Python library for automated preprocessing of mixed numeric, categorical, and text data.

Project description

Dapropy

A lightweight Python library for automated preprocessing of datasets containing mixed numeric, categorical, and text features. It cleans text, handles missing values, encodes categories, scales numerics, and persists transformers for consistent training and inference.


✨ Features

  • Mixed-type handling: numeric, categorical, and text columns in one pipeline
  • Missing values: simple fill strategies or mixed-type KNN imputation
  • Categorical encoding: Label encoding or One-Hot encoding
  • Scaling: standardization for numeric features
  • Text processing: clean HTML/URLs/emojis, lowercase, spell-correct, remove stopwords, stem, then Bag-of-Words
  • Optional data cleanup: basic inconsistency fixes, partial outlier capping, partial noise reduction
  • Persistence: saves encoders, vectorizer, scalers, imputer, and feature order for inference reproducibility

📦 Installation

pip install git+https://github.com/BlackIIIWhite/Dapropy

Python 3.8+ is required.

Note: On first use, NLTK resources are downloaded automatically (e.g., punkt, stopwords).


🚀 Quickstart

import pandas as pd
from dapropy import Dapropy

# Example data
train_df = pd.DataFrame({
    "age": [25, 30, None, 40],
    "city": ["NY", "LA", "LA", None],
    "review": [
        "Loved it! 😊 <b>Great</b> service",
        "Okay visit, would return",
        None,
        "Terrible... won't go again http://example.com"
    ],
    "label": [1, 0, 1, 0]
})

# Fit-time processing (saves transformers into ./transformers by default)
p = Dapropy(
    target="label",
    strategyED="Label",           # or "One-Hot"
    imputer_strategy="KNN",       # or fill/remove strategies
    enable_text_processing=True,
    strategyNLP="bag_of_words",
    fix_datainconsistencies=False,
    partialnoisereduction=False,
    partialcap_outliersiqr=False,
    folder_name="transformers"
)
X_train = p.full_process(train_df)

# Inference-time processing (reuses saved transformers)
new_df = pd.DataFrame({
    "age": [28],
    "city": ["LA"],
    "review": ["Service was fine, nothing special"],
})
X_infer = p.pipeline(new_df)

⚙️ Configuration

  • target: name of the target column (kept unscaled in output)
  • strategyED: categorical encoding strategy
    • "Label" (default)
    • "One-Hot"
  • imputer_strategy: missing value handling
    • "KNN" (default) – mixed-type KNN with safe label encoding/decoding
    • "remove", "fillna_mean", "fillna_median", "fillna_mode"
  • cap_ratio: fraction of detected outliers to cap (0–1)
  • smooth_ratio: fraction of detected noise to smooth (0–1)
  • window_size: rolling window for smoothing
  • enable_text_processing: enable/disable text cleaning and vectorization
  • strategyNLP: text representation, currently "bag_of_words"
  • fix_datainconsistencies: normalize common string representations, convert dates, drop duplicates
  • partialnoisereduction: apply optional smoothing to noisy numeric series
  • partialcap_outliersiqr: optionally cap a portion of IQR outliers
  • folder_name: directory to store transformers and metadata

🧠 What happens under the hood

Fit-time (full_process):

  1. Optional basic cleanup (fix_data_inconsistencies)
  2. Missing values (handle_missing_values, default KNN)
  3. Text preprocessing + Bag-of-Words (text_processing)
  4. Categorical encoding (encodingcategorical)
  5. Optional noise/outlier handling
  6. Scaling (scaling) for numerics
  7. Persist artifacts: encoders.pkl, vectorizer.pkl, scalers.pkl, imputer.pkl, feature_order.pkl

Inference-time (pipeline):

  1. Load persisted artifacts
  2. Same transformations as fit-time (without refitting)
  3. Align columns to saved feature_order (missing columns filled with 0)

📚 API Reference

  • class Dapropy(target=None, strategyED='Label', imputer_strategy='KNN', cap_ratio=0.9, smooth_ratio=0.9, window_size=3, enable_text_processing=True, strategyNLP='bag_of_words', fix_datainconsistencies=False, partialnoisereduction=False, partialcap_outliersiqr=False, folder_name='transformers')

    • Creates a preprocessing pipeline instance.
  • full_process(data: pd.DataFrame) -> pd.DataFrame

    • Runs the full fit-time pipeline and saves transformers/metadata.
  • pipeline(data: Union[dict, pd.Series, pd.DataFrame]) -> pd.DataFrame

    • Transforms new data for inference using saved artifacts.
  • handle_missing_values(data, strategy=None, n_neighbors=5)

    • Strategies: KNN, remove, fillna_mean, fillna_median, fillna_mode.
  • encodingcategorical(data, strategyED=None, fit_mode=True)

    • Encodes object/category columns via Label or One-Hot.
  • scaling(data, target=None, fit_mode=True)

    • Standardizes numeric columns; preserves target column values.
  • text_processing(data, column, fit_mode=True)

    • Cleans and vectorizes one text column; called internally for all text columns when enabled.
  • partial_cap_outliers_iqr(data, cap_ratio=None, random_state=42)

    • Caps a portion of IQR-defined outliers.
  • partial_noise_reduction(data, target=None, smooth_ratio=None, window_size=None, random_state=0)

    • Smooths a portion of detected noisy points in numeric series.

📁 Persistence details

Artifacts are saved to folder_name (default: transformers):

  • encoders.pkl, vectorizer.pkl, scalers.pkl, imputer.pkl, feature_order.pkl

You may delete this folder to reset the pipeline or change folder_name to maintain multiple versions.


✅ Tips

  • Ensure all columns besides target that contain free text are of dtype object so they are cleaned/vectorized when enable_text_processing=True.
  • For stable inference, keep the same preprocessing configuration and folder_name between training and serving.
  • If you pass a dict or pd.Series to pipeline, it will be converted to a one-row DataFrame.

🔧 Development

  • Python: 3.8+
  • Key dependencies: pandas, numpy, scikit-learn, emoji, nltk, textblob, joblib
  • Install locally for development:
pip install -e .

📝 License

MIT License. See LICENSE.txt for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Dapropy-0.1.3.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

Dapropy-0.1.3-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file Dapropy-0.1.3.tar.gz.

File metadata

  • Download URL: Dapropy-0.1.3.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for Dapropy-0.1.3.tar.gz
Algorithm Hash digest
SHA256 bac2b29775bc8d842910be429f9306a49937b90cd23db866e0e2b57d84fc6c9d
MD5 363089d78f832b35089e28996ccd631d
BLAKE2b-256 ddc7b06df0cd21b19960b47e92fabf0f8627d4c5203ebe11c5008512ea2233ab

See more details on using hashes here.

File details

Details for the file Dapropy-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: Dapropy-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for Dapropy-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2dd095e1e1598f738bc853317a6ea13f80398634cb2e8f6c9718073176db28bb
MD5 64aee8c6686dd6acb5c9907f45acc841
BLAKE2b-256 95bd6cdd260878358228689c2396a31760da7dc684628e905d899c03d72c5762

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page