Skip to main content

A Python library for automated preprocessing of mixed numeric, categorical, and text data.

Project description

Dapropy

A lightweight Python library for automated preprocessing of datasets containing mixed numeric, categorical, and text features. It cleans text, handles missing values, encodes categories, scales numerics, and persists transformers for consistent training and inference.


✨ Features

  • Mixed-type handling: numeric, categorical, and text columns in one pipeline
  • Missing values: simple fill strategies or mixed-type KNN imputation
  • Categorical encoding: Label encoding or One-Hot encoding
  • Scaling: standardization for numeric features
  • Text processing: clean HTML/URLs/emojis, lowercase, spell-correct, remove stopwords, stem, then Bag-of-Words
  • Optional data cleanup: basic inconsistency fixes, partial outlier capping, partial noise reduction
  • Persistence: saves encoders, vectorizer, scalers, imputer, and feature order for inference reproducibility

📦 Installation

pip install git+https://github.com/BlackIIIWhite/Dapropy

Python 3.8+ is required.

Note: On first use, NLTK resources are downloaded automatically (e.g., punkt, stopwords).


🚀 Quickstart

import pandas as pd
from dapropy import Dapropy

# Example data
train_df = pd.DataFrame({
    "age": [25, 30, None, 40],
    "city": ["NY", "LA", "LA", None],
    "review": [
        "Loved it! 😊 <b>Great</b> service",
        "Okay visit, would return",
        None,
        "Terrible... won't go again http://example.com"
    ],
    "label": [1, 0, 1, 0]
})

# Fit-time processing (saves transformers into ./transformers by default)
p = Dapropy(
    target="label",
    strategyED="Label",           # or "One-Hot"
    imputer_strategy="KNN",       # or fill/remove strategies
    enable_text_processing=True,
    strategyNLP="bag_of_words",
    fix_datainconsistencies=False,
    partialnoisereduction=False,
    partialcap_outliersiqr=False,
    folder_name="transformers"
)
X_train = p.full_process(train_df)

# Inference-time processing (reuses saved transformers)
new_df = pd.DataFrame({
    "age": [28],
    "city": ["LA"],
    "review": ["Service was fine, nothing special"],
})
X_infer = p.pipeline(new_df)

⚙️ Configuration

  • target: name of the target column (kept unscaled in output)
  • strategyED: categorical encoding strategy
    • "Label" (default)
    • "One-Hot"
  • imputer_strategy: missing value handling
    • "KNN" (default) – mixed-type KNN with safe label encoding/decoding
    • "remove", "fillna_mean", "fillna_median", "fillna_mode"
  • cap_ratio: fraction of detected outliers to cap (0–1)
  • smooth_ratio: fraction of detected noise to smooth (0–1)
  • window_size: rolling window for smoothing
  • enable_text_processing: enable/disable text cleaning and vectorization
  • strategyNLP: text representation, currently "bag_of_words"
  • fix_datainconsistencies: normalize common string representations, convert dates, drop duplicates
  • partialnoisereduction: apply optional smoothing to noisy numeric series
  • partialcap_outliersiqr: optionally cap a portion of IQR outliers
  • folder_name: directory to store transformers and metadata

🧠 What happens under the hood

Fit-time (full_process):

  1. Optional basic cleanup (fix_data_inconsistencies)
  2. Missing values (handle_missing_values, default KNN)
  3. Text preprocessing + Bag-of-Words (text_processing)
  4. Categorical encoding (encodingcategorical)
  5. Optional noise/outlier handling
  6. Scaling (scaling) for numerics
  7. Persist artifacts: encoders.pkl, vectorizer.pkl, scalers.pkl, imputer.pkl, feature_order.pkl

Inference-time (pipeline):

  1. Load persisted artifacts
  2. Same transformations as fit-time (without refitting)
  3. Align columns to saved feature_order (missing columns filled with 0)

📚 API Reference

  • class Dapropy(target=None, strategyED='Label', imputer_strategy='KNN', cap_ratio=0.9, smooth_ratio=0.9, window_size=3, enable_text_processing=True, strategyNLP='bag_of_words', fix_datainconsistencies=False, partialnoisereduction=False, partialcap_outliersiqr=False, folder_name='transformers')

    • Creates a preprocessing pipeline instance.
  • full_process(data: pd.DataFrame) -> pd.DataFrame

    • Runs the full fit-time pipeline and saves transformers/metadata.
  • pipeline(data: Union[dict, pd.Series, pd.DataFrame]) -> pd.DataFrame

    • Transforms new data for inference using saved artifacts.
  • handle_missing_values(data, strategy=None, n_neighbors=5)

    • Strategies: KNN, remove, fillna_mean, fillna_median, fillna_mode.
  • encodingcategorical(data, strategyED=None, fit_mode=True)

    • Encodes object/category columns via Label or One-Hot.
  • scaling(data, target=None, fit_mode=True)

    • Standardizes numeric columns; preserves target column values.
  • text_processing(data, column, fit_mode=True)

    • Cleans and vectorizes one text column; called internally for all text columns when enabled.
  • partial_cap_outliers_iqr(data, cap_ratio=None, random_state=42)

    • Caps a portion of IQR-defined outliers.
  • partial_noise_reduction(data, target=None, smooth_ratio=None, window_size=None, random_state=0)

    • Smooths a portion of detected noisy points in numeric series.

📁 Persistence details

Artifacts are saved to folder_name (default: transformers):

  • encoders.pkl, vectorizer.pkl, scalers.pkl, imputer.pkl, feature_order.pkl

You may delete this folder to reset the pipeline or change folder_name to maintain multiple versions.


✅ Tips

  • Ensure all columns besides target that contain free text are of dtype object so they are cleaned/vectorized when enable_text_processing=True.
  • For stable inference, keep the same preprocessing configuration and folder_name between training and serving.
  • If you pass a dict or pd.Series to pipeline, it will be converted to a one-row DataFrame.

🔧 Development

  • Python: 3.8+
  • Key dependencies: pandas, numpy, scikit-learn, emoji, nltk, textblob, joblib
  • Install locally for development:
pip install -e .

📝 License

MIT License. See LICENSE.txt for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Dapropy-0.1.1.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

Dapropy-0.1.1-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file Dapropy-0.1.1.tar.gz.

File metadata

  • Download URL: Dapropy-0.1.1.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for Dapropy-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d32ae8befbf51e9bb364c6085a1f434fcc03c86eff126b977d97ff405e5c6472
MD5 f59d3a34be77e0aa15a28b92a230b299
BLAKE2b-256 ac60d77e1bc944334cb8520cd01c82ed65efe31ca67bf885c83782f1e89fd870

See more details on using hashes here.

File details

Details for the file Dapropy-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: Dapropy-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for Dapropy-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3602fc26f38f0a9412ffc5bdda8d3dd1aafcb9ab6fd659a2172475498948876b
MD5 def2c63c2cb31a0d1a9e44bbd3769c4b
BLAKE2b-256 17fbf8206ad526198e60b57ba3a7be94cdf1168ee26db6f102d559e7ca306739

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page