A Python library for automated preprocessing of mixed numeric, categorical, and text data.
Project description
Dapropy
A lightweight Python library for automated preprocessing of datasets containing mixed numeric, categorical, and text features. It cleans text, handles missing values, encodes categories, scales numerics, and persists transformers for consistent training and inference.
✨ Features
- Mixed-type handling: numeric, categorical, and text columns in one pipeline
- Missing values: simple fill strategies or mixed-type KNN imputation
- Categorical encoding: Label encoding or One-Hot encoding
- Scaling: standardization for numeric features
- Text processing: clean HTML/URLs/emojis, lowercase, spell-correct, remove stopwords, stem, then Bag-of-Words
- Optional data cleanup: basic inconsistency fixes, partial outlier capping, partial noise reduction
- Persistence: saves encoders, vectorizer, scalers, imputer, and feature order for inference reproducibility
📦 Installation
pip install git+https://github.com/BlackIIIWhite/Dapropy
Python 3.8+ is required.
Note: On first use, NLTK resources are downloaded automatically (e.g., punkt, stopwords).
🚀 Quickstart
import pandas as pd
from dapropy import Dapropy
# Example data
train_df = pd.DataFrame({
"age": [25, 30, None, 40],
"city": ["NY", "LA", "LA", None],
"review": [
"Loved it! 😊 <b>Great</b> service",
"Okay visit, would return",
None,
"Terrible... won't go again http://example.com"
],
"label": [1, 0, 1, 0]
})
# Fit-time processing (saves transformers into ./transformers by default)
p = Dapropy(
target="label",
strategyED="Label", # or "One-Hot"
imputer_strategy="KNN", # or fill/remove strategies
enable_text_processing=True,
strategyNLP="bag_of_words",
fix_datainconsistencies=False,
partialnoisereduction=False,
partialcap_outliersiqr=False,
folder_name="transformers"
)
X_train = p.full_process(train_df)
# Inference-time processing (reuses saved transformers)
new_df = pd.DataFrame({
"age": [28],
"city": ["LA"],
"review": ["Service was fine, nothing special"],
})
X_infer = p.pipeline(new_df)
⚙️ Configuration
- target: name of the target column (kept unscaled in output)
- strategyED: categorical encoding strategy
"Label"(default)"One-Hot"
- imputer_strategy: missing value handling
"KNN"(default) – mixed-type KNN with safe label encoding/decoding"remove","fillna_mean","fillna_median","fillna_mode"
- cap_ratio: fraction of detected outliers to cap (0–1)
- smooth_ratio: fraction of detected noise to smooth (0–1)
- window_size: rolling window for smoothing
- enable_text_processing: enable/disable text cleaning and vectorization
- strategyNLP: text representation, currently
"bag_of_words" - fix_datainconsistencies: normalize common string representations, convert dates, drop duplicates
- partialnoisereduction: apply optional smoothing to noisy numeric series
- partialcap_outliersiqr: optionally cap a portion of IQR outliers
- folder_name: directory to store transformers and metadata
🧠 What happens under the hood
Fit-time (full_process):
- Optional basic cleanup (
fix_data_inconsistencies) - Missing values (
handle_missing_values, defaultKNN) - Text preprocessing + Bag-of-Words (
text_processing) - Categorical encoding (
encodingcategorical) - Optional noise/outlier handling
- Scaling (
scaling) for numerics - Persist artifacts:
encoders.pkl,vectorizer.pkl,scalers.pkl,imputer.pkl,feature_order.pkl
Inference-time (pipeline):
- Load persisted artifacts
- Same transformations as fit-time (without refitting)
- Align columns to saved
feature_order(missing columns filled with 0)
📚 API Reference
-
class Dapropy(target=None, strategyED='Label', imputer_strategy='KNN', cap_ratio=0.9, smooth_ratio=0.9, window_size=3, enable_text_processing=True, strategyNLP='bag_of_words', fix_datainconsistencies=False, partialnoisereduction=False, partialcap_outliersiqr=False, folder_name='transformers')- Creates a preprocessing pipeline instance.
-
full_process(data: pd.DataFrame) -> pd.DataFrame- Runs the full fit-time pipeline and saves transformers/metadata.
-
pipeline(data: Union[dict, pd.Series, pd.DataFrame]) -> pd.DataFrame- Transforms new data for inference using saved artifacts.
-
handle_missing_values(data, strategy=None, n_neighbors=5)- Strategies:
KNN,remove,fillna_mean,fillna_median,fillna_mode.
- Strategies:
-
encodingcategorical(data, strategyED=None, fit_mode=True)- Encodes object/category columns via Label or One-Hot.
-
scaling(data, target=None, fit_mode=True)- Standardizes numeric columns; preserves
targetcolumn values.
- Standardizes numeric columns; preserves
-
text_processing(data, column, fit_mode=True)- Cleans and vectorizes one text column; called internally for all text columns when enabled.
-
partial_cap_outliers_iqr(data, cap_ratio=None, random_state=42)- Caps a portion of IQR-defined outliers.
-
partial_noise_reduction(data, target=None, smooth_ratio=None, window_size=None, random_state=0)- Smooths a portion of detected noisy points in numeric series.
📁 Persistence details
Artifacts are saved to folder_name (default: transformers):
encoders.pkl,vectorizer.pkl,scalers.pkl,imputer.pkl,feature_order.pkl
You may delete this folder to reset the pipeline or change folder_name to maintain multiple versions.
✅ Tips
- Ensure all columns besides
targetthat contain free text are of dtypeobjectso they are cleaned/vectorized whenenable_text_processing=True. - For stable inference, keep the same preprocessing configuration and
folder_namebetween training and serving. - If you pass a
dictorpd.Seriestopipeline, it will be converted to a one-rowDataFrame.
🔧 Development
- Python: 3.8+
- Key dependencies:
pandas,numpy,scikit-learn,emoji,nltk,textblob,joblib - Install locally for development:
pip install -e .
📝 License
MIT License. See LICENSE.txt for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file Dapropy-0.1.3.tar.gz.
File metadata
- Download URL: Dapropy-0.1.3.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bac2b29775bc8d842910be429f9306a49937b90cd23db866e0e2b57d84fc6c9d
|
|
| MD5 |
363089d78f832b35089e28996ccd631d
|
|
| BLAKE2b-256 |
ddc7b06df0cd21b19960b47e92fabf0f8627d4c5203ebe11c5008512ea2233ab
|
File details
Details for the file Dapropy-0.1.3-py3-none-any.whl.
File metadata
- Download URL: Dapropy-0.1.3-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2dd095e1e1598f738bc853317a6ea13f80398634cb2e8f6c9718073176db28bb
|
|
| MD5 |
64aee8c6686dd6acb5c9907f45acc841
|
|
| BLAKE2b-256 |
95bd6cdd260878358228689c2396a31760da7dc684628e905d899c03d72c5762
|