Dapropy

A Python library for automated preprocessing of mixed numeric, categorical, and text data.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Dapropy

A lightweight Python library for automated preprocessing of datasets containing mixed numeric, categorical, and text features. It cleans text, handles missing values, encodes categories, scales numerics, and persists transformers for consistent training and inference.

✨ Features

Mixed-type handling: numeric, categorical, and text columns in one pipeline
Missing values: simple fill strategies or mixed-type KNN imputation
Categorical encoding: Label encoding or One-Hot encoding
Scaling: standardization for numeric features
Text processing: clean HTML/URLs/emojis, lowercase, spell-correct, remove stopwords, stem, then Bag-of-Words
Optional data cleanup: basic inconsistency fixes, partial outlier capping, partial noise reduction
Persistence: saves encoders, vectorizer, scalers, imputer, and feature order for inference reproducibility

📦 Installation

pip install git+https://github.com/BlackIIIWhite/Dapropy

Python 3.8+ is required.

Note: On first use, NLTK resources are downloaded automatically (e.g., punkt, stopwords).

🚀 Quickstart

import pandas as pd
from dapropy import Dapropy

# Example data
train_df = pd.DataFrame({
    "age": [25, 30, None, 40],
    "city": ["NY", "LA", "LA", None],
    "review": [
        "Loved it! 😊 <b>Great</b> service",
        "Okay visit, would return",
        None,
        "Terrible... won't go again http://example.com"
    ],
    "label": [1, 0, 1, 0]
})

# Fit-time processing (saves transformers into ./transformers by default)
p = Dapropy(
    target="label",
    strategyED="Label",           # or "One-Hot"
    imputer_strategy="KNN",       # or fill/remove strategies
    enable_text_processing=True,
    strategyNLP="bag_of_words",
    fix_datainconsistencies=False,
    partialnoisereduction=False,
    partialcap_outliersiqr=False,
    folder_name="transformers"
)
X_train = p.full_process(train_df)

# Inference-time processing (reuses saved transformers)
new_df = pd.DataFrame({
    "age": [28],
    "city": ["LA"],
    "review": ["Service was fine, nothing special"],
})
X_infer = p.pipeline(new_df)

⚙️ Configuration

target: name of the target column (kept unscaled in output)
strategyED: categorical encoding strategy
- "Label" (default)
- "One-Hot"
imputer_strategy: missing value handling
- "KNN" (default) – mixed-type KNN with safe label encoding/decoding
- "remove", "fillna_mean", "fillna_median", "fillna_mode"
cap_ratio: fraction of detected outliers to cap (0–1)
smooth_ratio: fraction of detected noise to smooth (0–1)
window_size: rolling window for smoothing
enable_text_processing: enable/disable text cleaning and vectorization
strategyNLP: text representation, currently "bag_of_words"
fix_datainconsistencies: normalize common string representations, convert dates, drop duplicates
partialnoisereduction: apply optional smoothing to noisy numeric series
partialcap_outliersiqr: optionally cap a portion of IQR outliers
folder_name: directory to store transformers and metadata

🧠 What happens under the hood

Fit-time (full_process):

Optional basic cleanup (fix_data_inconsistencies)
Missing values (handle_missing_values, default KNN)
Text preprocessing + Bag-of-Words (text_processing)
Categorical encoding (encodingcategorical)
Optional noise/outlier handling
Scaling (scaling) for numerics
Persist artifacts: encoders.pkl, vectorizer.pkl, scalers.pkl, imputer.pkl, feature_order.pkl

Inference-time (pipeline):

Load persisted artifacts
Same transformations as fit-time (without refitting)
Align columns to saved feature_order (missing columns filled with 0)

📚 API Reference

class Dapropy(target=None, strategyED='Label', imputer_strategy='KNN', cap_ratio=0.9, smooth_ratio=0.9, window_size=3, enable_text_processing=True, strategyNLP='bag_of_words', fix_datainconsistencies=False, partialnoisereduction=False, partialcap_outliersiqr=False, folder_name='transformers')
- Creates a preprocessing pipeline instance.
full_process(data: pd.DataFrame) -> pd.DataFrame
- Runs the full fit-time pipeline and saves transformers/metadata.
pipeline(data: Union[dict, pd.Series, pd.DataFrame]) -> pd.DataFrame
- Transforms new data for inference using saved artifacts.
handle_missing_values(data, strategy=None, n_neighbors=5)
- Strategies: KNN, remove, fillna_mean, fillna_median, fillna_mode.
encodingcategorical(data, strategyED=None, fit_mode=True)
- Encodes object/category columns via Label or One-Hot.
scaling(data, target=None, fit_mode=True)
- Standardizes numeric columns; preserves target column values.
text_processing(data, column, fit_mode=True)
- Cleans and vectorizes one text column; called internally for all text columns when enabled.
partial_cap_outliers_iqr(data, cap_ratio=None, random_state=42)
- Caps a portion of IQR-defined outliers.
partial_noise_reduction(data, target=None, smooth_ratio=None, window_size=None, random_state=0)
- Smooths a portion of detected noisy points in numeric series.

📁 Persistence details

Artifacts are saved to folder_name (default: transformers):

encoders.pkl, vectorizer.pkl, scalers.pkl, imputer.pkl, feature_order.pkl

You may delete this folder to reset the pipeline or change folder_name to maintain multiple versions.

✅ Tips

Ensure all columns besides target that contain free text are of dtype object so they are cleaned/vectorized when enable_text_processing=True.
For stable inference, keep the same preprocessing configuration and folder_name between training and serving.
If you pass a dict or pd.Series to pipeline, it will be converted to a one-row DataFrame.

🔧 Development

Python: 3.8+
Key dependencies: pandas, numpy, scikit-learn, emoji, nltk, textblob, joblib
Install locally for development:

pip install -e .

📝 License

MIT License. See LICENSE.txt for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.3

Aug 13, 2025

This version

0.1.2

Aug 13, 2025

0.1.1

Aug 13, 2025

0.1.0

Aug 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Dapropy-0.1.2.tar.gz (9.2 kB view details)

Uploaded Aug 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

Dapropy-0.1.2-py3-none-any.whl (9.3 kB view details)

Uploaded Aug 13, 2025 Python 3

File details

Details for the file Dapropy-0.1.2.tar.gz.

File metadata

Download URL: Dapropy-0.1.2.tar.gz
Upload date: Aug 13, 2025
Size: 9.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for Dapropy-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`b5493612fe3d8e902ea1bb4f797ae2c2a5daf4123c5da5814e3f56fe00d40a9c`
MD5	`285dfe3a0f5e6211766fa986ff3ab83d`
BLAKE2b-256	`5b4675072625f500f404790ec8830745a86837227941356f33d18b2ef1329677`

See more details on using hashes here.

File details

Details for the file Dapropy-0.1.2-py3-none-any.whl.

File metadata

Download URL: Dapropy-0.1.2-py3-none-any.whl
Upload date: Aug 13, 2025
Size: 9.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for Dapropy-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`865130f80787cc4cf43adc3912160ea5e6e3106df62728642f2c89197b95a616`
MD5	`7a5f3738aaf8ac09fbb7e8b8818bf302`
BLAKE2b-256	`e482f272c4c439cefb558cf11ceb7d63b3efae3c17a0b3641449c0f13a13f29b`

See more details on using hashes here.

Dapropy 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Dapropy

✨ Features

📦 Installation

🚀 Quickstart

⚙️ Configuration

🧠 What happens under the hood

📚 API Reference

📁 Persistence details

✅ Tips

🔧 Development

📝 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes