Flexible tabular data preprocessing utility with a single AutoSweep API
Project description
autosweep-preprocessing
A lightweight preprocessing library built around a single flexible API: AutoSweep.
Usage
from autosweep_preprocessing import AutoSweep
result = AutoSweep(
file_path="data.csv",
target_column="target",
encode_categorical="onehot",
remove_correlated=True,
structured_output=True,
)
X = result["X"]
y = result["y"]
info = result["info"]
Function
AutoSweep supports:
- CSV/Excel loading
- Missing value handling and imputation
- Numeric scaling (
standard,minmax,robust) - Categorical encoding (
onehot,ordinal,label) - Optional datetime feature extraction
- Optional outlier handling (
iqr,zscore) - Optional correlation and low-variance filtering
- Structured output for pipeline diagnostics
AutoSweep Arguments Guide
Required / Core
-
file_path(required)- What it does: Path to input dataset (
.csvor Excel file). - Use case: Point to your raw training file before preprocessing.
- Example:
file_path="data/train.csv"
- What it does: Path to input dataset (
-
target_column(default:None)- What it does: Separates target variable from features and returns it as
y. - Use case: Set this when you want to train/evaluate models after preprocessing.
- Example:
target_column="price"
- What it does: Separates target variable from features and returns it as
Column cleaning
-
drop_columns(default:None)- What it does: Drops specific columns by name.
- Use case: Remove IDs, leakage columns, or metadata fields.
- Example:
drop_columns=["id", "created_at"]
-
drop_threshold(default:1.0)- What it does: Drops columns whose missing-value fraction is greater than this threshold.
- Use case: Use
0.4/0.5to remove heavily incomplete columns. - Example:
drop_threshold=0.5
Missing values
-
impute_strategy_num(default:'mean')- What it does: Numeric imputation strategy.
- Allowed:
'mean','median','most_frequent','constant','knn','mode'. - Use case: Use
'median'for skewed numeric data,'knn'for richer local patterns. - Example:
impute_strategy_num="median"
-
impute_strategy_cat(default:'most_frequent')- What it does: Categorical imputation strategy.
- Allowed: any
SimpleImputercategorical strategy (commonly'most_frequent','constant'). - Use case: Use
'most_frequent'for stable categories. - Example:
impute_strategy_cat="most_frequent"
Scaling and encoding
-
scaler(default:None)- What it does: Scales numeric features.
- Allowed:
None,'none','passthrough','standard','minmax','robust'. - Behavior: No scaling is applied unless you explicitly choose a scaler.
- Use case: Use
'robust'when outliers are present. - Example:
scaler="robust"
-
encode_categorical(default:None)- What it does: Encodes categorical columns.
- Allowed:
None,'none','passthrough','onehot','ordinal','label'. - Use case: Use
'onehot'for linear/tree models;'label'for compact numeric conversion. - Example:
encode_categorical="onehot"
Feature selection
-
remove_low_variance(default:False)- What it does: Removes low-variance numeric features after preprocessing.
- Use case: Enable when many near-constant numeric features exist.
- Example:
remove_low_variance=True
-
variance_thresh(default:0.0)- What it does: Variance cutoff used by low-variance filtering.
- Use case: Increase (e.g.,
0.01) to remove weak/noisy features. - Example:
variance_thresh=0.01
-
remove_correlated(default:False)- What it does: Drops highly correlated numeric features.
- Use case: Reduce multicollinearity and redundant columns.
- Example:
remove_correlated=True
-
corr_threshold(default:0.95)- What it does: Absolute correlation threshold for dropping features.
- Use case: Use
0.85-0.95depending on how aggressively you want feature pruning. - Example:
corr_threshold=0.9
Outlier handling
-
outlier_method(default:None)- What it does: Enables outlier detection.
- Allowed:
None,'iqr','zscore'(also'z-score','z_score'). - Use case: Use
'iqr'for non-normal data;'zscore'for roughly normal distributions. - Example:
outlier_method="iqr"
-
outlier_threshold(default:1.5)- What it does: Threshold used by outlier method.
- Use case: Increase to keep more rows, decrease to be stricter.
- Example:
outlier_threshold=3.0(common for z-score)
-
cap_outliers(default:False)- What it does: Caps outliers to bounds instead of dropping rows.
- Use case: Set
Truewhen you want to preserve dataset size. - Example:
cap_outliers=True
Datetime features
-
extract_datetime(default:False)- What it does: Parses datetime-like columns and extracts year/month/day/weekday/hour.
- Use case: Enable when date fields carry predictive signal.
- Example:
extract_datetime=True
-
drop_datetime_original(default:False)- What it does: Drops original datetime columns after extraction.
- Use case: Keep only engineered datetime parts to simplify model input.
- Example:
drop_datetime_original=True
Target encoding and output format
-
target_encode(default:False)- What it does: Applies mean target encoding to categorical features.
- Use case: Helpful for high-cardinality categorical variables.
- Important: Requires
target_column; avoid leakage by fitting only on training data in production workflows. - Example:
target_encode=True
-
structured_output(default:True)- What it does: Controls return format.
- If
True: returns{ 'X', 'y', 'feature_names', 'info' }. - If
False: returns tuple(s) (X, y, feature_namesorX, feature_names). - Use case: Keep
Truefor debugging and pipeline introspection.
-
verbose(default:True)- What it does: Prints detailed preprocessing diagnostics.
- Use case: Set
Falsefor cleaner logs in training pipelines. - Example:
verbose=False
Notes
- If you use Excel input, keep
openpyxlinstalled. - If
target_encode=True, provide a validtarget_column.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autosweep_preprocessing-0.1.2.tar.gz.
File metadata
- Download URL: autosweep_preprocessing-0.1.2.tar.gz
- Upload date:
- Size: 12.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72f72de2a06b4876ae6486bb8c6f626096871008d7c9e193954acebd8ae0702b
|
|
| MD5 |
eeaf35000619f4007c2ea3a2558fe98a
|
|
| BLAKE2b-256 |
2da05baf11d4f1f0d20a4e9b398b531fb49811299e47ec1879915cb1e518529b
|
File details
Details for the file autosweep_preprocessing-0.1.2-py3-none-any.whl.
File metadata
- Download URL: autosweep_preprocessing-0.1.2-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
508cdc86b278fad47e6f9e4ba6fb21de89ec41a9c261a920b056a47131d13886
|
|
| MD5 |
314d73d1918d0f4922933005ca495324
|
|
| BLAKE2b-256 |
557586803bccd5c6919c21a4b8331902957338fdc16613391030cecd55adaeb9
|