Flexible tabular data preprocessing utility with a single AutoSweep API

These details have not been verified by PyPI

Project links

Project description

autosweep-preprocessing

A lightweight preprocessing library built around a single flexible API: AutoSweep.

Usage

from autosweep_preprocessing import AutoSweep

result = AutoSweep(
    file_path="data.csv",
    target_column="target",
    encode_categorical="onehot",
    remove_correlated=True,
    structured_output=True,
)

X = result["X"]
y = result["y"]
info = result["info"]

Function

AutoSweep supports:

CSV/Excel loading
Missing value handling and imputation
Numeric scaling (standard, minmax, robust)
Categorical encoding (onehot, ordinal, label)
Optional datetime feature extraction
Optional outlier handling (iqr, zscore)
Optional correlation and low-variance filtering
Structured output for pipeline diagnostics

AutoSweep Arguments Guide

Required / Core

file_path (required)
- What it does: Path to input dataset (.csv or Excel file).
- Use case: Point to your raw training file before preprocessing.
- Example: file_path="data/train.csv"
target_column (default: None)
- What it does: Separates target variable from features and returns it as y.
- Use case: Set this when you want to train/evaluate models after preprocessing.
- Example: target_column="price"

Column cleaning

drop_columns (default: None)
- What it does: Drops specific columns by name.
- Use case: Remove IDs, leakage columns, or metadata fields.
- Example: drop_columns=["id", "created_at"]
drop_threshold (default: 1.0)
- What it does: Drops columns whose missing-value fraction is greater than this threshold.
- Use case: Use 0.4/0.5 to remove heavily incomplete columns.
- Example: drop_threshold=0.5

Missing values

impute_strategy_num (default: 'mean')
- What it does: Numeric imputation strategy.
- Allowed: 'mean', 'median', 'most_frequent', 'constant', 'knn', 'mode'.
- Use case: Use 'median' for skewed numeric data, 'knn' for richer local patterns.
- Example: impute_strategy_num="median"
impute_strategy_cat (default: 'most_frequent')
- What it does: Categorical imputation strategy.
- Allowed: any SimpleImputer categorical strategy (commonly 'most_frequent', 'constant').
- Use case: Use 'most_frequent' for stable categories.
- Example: impute_strategy_cat="most_frequent"

Scaling and encoding

scaler (default: 'standard')
- What it does: Scales numeric features.
- Allowed: 'standard', 'minmax', 'robust', or any other value for passthrough.
- Use case: Use 'robust' when outliers are present.
- Example: scaler="robust"
encode_categorical (default: None)
- What it does: Encodes categorical columns.
- Allowed: None, 'none', 'passthrough', 'onehot', 'ordinal', 'label'.
- Use case: Use 'onehot' for linear/tree models; 'label' for compact numeric conversion.
- Example: encode_categorical="onehot"

Feature selection

remove_low_variance (default: False)
- What it does: Removes low-variance numeric features after preprocessing.
- Use case: Enable when many near-constant numeric features exist.
- Example: remove_low_variance=True
variance_thresh (default: 0.0)
- What it does: Variance cutoff used by low-variance filtering.
- Use case: Increase (e.g., 0.01) to remove weak/noisy features.
- Example: variance_thresh=0.01
remove_correlated (default: False)
- What it does: Drops highly correlated numeric features.
- Use case: Reduce multicollinearity and redundant columns.
- Example: remove_correlated=True
corr_threshold (default: 0.95)
- What it does: Absolute correlation threshold for dropping features.
- Use case: Use 0.85-0.95 depending on how aggressively you want feature pruning.
- Example: corr_threshold=0.9

Outlier handling

outlier_method (default: None)
- What it does: Enables outlier detection.
- Allowed: None, 'iqr', 'zscore' (also 'z-score', 'z_score').
- Use case: Use 'iqr' for non-normal data; 'zscore' for roughly normal distributions.
- Example: outlier_method="iqr"
outlier_threshold (default: 1.5)
- What it does: Threshold used by outlier method.
- Use case: Increase to keep more rows, decrease to be stricter.
- Example: outlier_threshold=3.0 (common for z-score)
cap_outliers (default: False)
- What it does: Caps outliers to bounds instead of dropping rows.
- Use case: Set True when you want to preserve dataset size.
- Example: cap_outliers=True

Datetime features

extract_datetime (default: False)
- What it does: Parses datetime-like columns and extracts year/month/day/weekday/hour.
- Use case: Enable when date fields carry predictive signal.
- Example: extract_datetime=True
drop_datetime_original (default: False)
- What it does: Drops original datetime columns after extraction.
- Use case: Keep only engineered datetime parts to simplify model input.
- Example: drop_datetime_original=True

Target encoding and output format

target_encode (default: False)
- What it does: Applies mean target encoding to categorical features.
- Use case: Helpful for high-cardinality categorical variables.
- Important: Requires target_column; avoid leakage by fitting only on training data in production workflows.
- Example: target_encode=True
structured_output (default: True)
- What it does: Controls return format.
- If True: returns { 'X', 'y', 'feature_names', 'info' }.
- If False: returns tuple(s) (X, y, feature_names or X, feature_names).
- Use case: Keep True for debugging and pipeline introspection.
verbose (default: True)
- What it does: Prints detailed preprocessing diagnostics.
- Use case: Set False for cleaner logs in training pipelines.
- Example: verbose=False

Notes

If you use Excel input, keep openpyxl installed.
If target_encode=True, provide a valid target_column.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

Feb 25, 2026

This version

0.1.1

Feb 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autosweep_preprocessing-0.1.1.tar.gz (12.1 kB view details)

Uploaded Feb 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autosweep_preprocessing-0.1.1-py3-none-any.whl (10.3 kB view details)

Uploaded Feb 25, 2026 Python 3

File details

Details for the file autosweep_preprocessing-0.1.1.tar.gz.

File metadata

Download URL: autosweep_preprocessing-0.1.1.tar.gz
Upload date: Feb 25, 2026
Size: 12.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for autosweep_preprocessing-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`33cc5b4d8212e23e6eaffc570579c4ffee1477e4c92d2fd7b77ed87c60ce9564`
MD5	`cc7a92fb2b3c8efd24f68c6dc00cdf6a`
BLAKE2b-256	`8be7e08415ca671a1a0924c62bd050e4f0825b0b4cf61013da29d34618f241da`

See more details on using hashes here.

File details

Details for the file autosweep_preprocessing-0.1.1-py3-none-any.whl.

File metadata

Download URL: autosweep_preprocessing-0.1.1-py3-none-any.whl
Upload date: Feb 25, 2026
Size: 10.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for autosweep_preprocessing-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`78da8300d81295ed63ff36b142d4771282c93acc0fc465a604c380e5433bcff1`
MD5	`2543d3516795dd40bd0f61f953a9bef9`
BLAKE2b-256	`421256f6b6a1fac248eea87989d2016fe81f6827b0c117a8721a3c1596b01050`

See more details on using hashes here.

autosweep-preprocessing 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

autosweep-preprocessing

Usage

Function

AutoSweep Arguments Guide

Required / Core

Column cleaning

Missing values

Scaling and encoding

Feature selection

Outlier handling

Datetime features

Target encoding and output format

Notes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes