Skip to main content

Flexible tabular data preprocessing utility with a single AutoSweep API

Project description

autosweep-preprocessing

A lightweight preprocessing library built around a single flexible API: AutoSweep.

Usage

from autosweep_preprocessing import AutoSweep

result = AutoSweep(
    file_path="data.csv",
    target_column="target",
    encode_categorical="onehot",
    remove_correlated=True,
    structured_output=True,
)

X = result["X"]
y = result["y"]
info = result["info"]

Function

AutoSweep supports:

  • CSV/Excel loading
  • Missing value handling and imputation
  • Numeric scaling (standard, minmax, robust)
  • Categorical encoding (onehot, ordinal, label)
  • Optional datetime feature extraction
  • Optional outlier handling (iqr, zscore)
  • Optional correlation and low-variance filtering
  • Structured output for pipeline diagnostics

AutoSweep Arguments Guide

Required / Core

  • file_path (required)

    • What it does: Path to input dataset (.csv or Excel file).
    • Use case: Point to your raw training file before preprocessing.
    • Example: file_path="data/train.csv"
  • target_column (default: None)

    • What it does: Separates target variable from features and returns it as y.
    • Use case: Set this when you want to train/evaluate models after preprocessing.
    • Example: target_column="price"

Column cleaning

  • drop_columns (default: None)

    • What it does: Drops specific columns by name.
    • Use case: Remove IDs, leakage columns, or metadata fields.
    • Example: drop_columns=["id", "created_at"]
  • drop_threshold (default: 1.0)

    • What it does: Drops columns whose missing-value fraction is greater than this threshold.
    • Use case: Use 0.4/0.5 to remove heavily incomplete columns.
    • Example: drop_threshold=0.5

Missing values

  • impute_strategy_num (default: 'mean')

    • What it does: Numeric imputation strategy.
    • Allowed: 'mean', 'median', 'most_frequent', 'constant', 'knn', 'mode'.
    • Use case: Use 'median' for skewed numeric data, 'knn' for richer local patterns.
    • Example: impute_strategy_num="median"
  • impute_strategy_cat (default: 'most_frequent')

    • What it does: Categorical imputation strategy.
    • Allowed: any SimpleImputer categorical strategy (commonly 'most_frequent', 'constant').
    • Use case: Use 'most_frequent' for stable categories.
    • Example: impute_strategy_cat="most_frequent"

Scaling and encoding

  • scaler (default: None)

    • What it does: Scales numeric features.
    • Allowed: None, 'none', 'passthrough', 'standard', 'minmax', 'robust'.
    • Behavior: No scaling is applied unless you explicitly choose a scaler.
    • Use case: Use 'robust' when outliers are present.
    • Example: scaler="robust"
  • encode_categorical (default: None)

    • What it does: Encodes categorical columns.
    • Allowed: None, 'none', 'passthrough', 'onehot', 'ordinal', 'label'.
    • Use case: Use 'onehot' for linear/tree models; 'label' for compact numeric conversion.
    • Example: encode_categorical="onehot"

Feature selection

  • remove_low_variance (default: False)

    • What it does: Removes low-variance numeric features after preprocessing.
    • Use case: Enable when many near-constant numeric features exist.
    • Example: remove_low_variance=True
  • variance_thresh (default: 0.0)

    • What it does: Variance cutoff used by low-variance filtering.
    • Use case: Increase (e.g., 0.01) to remove weak/noisy features.
    • Example: variance_thresh=0.01
  • remove_correlated (default: False)

    • What it does: Drops highly correlated numeric features.
    • Use case: Reduce multicollinearity and redundant columns.
    • Example: remove_correlated=True
  • corr_threshold (default: 0.95)

    • What it does: Absolute correlation threshold for dropping features.
    • Use case: Use 0.85-0.95 depending on how aggressively you want feature pruning.
    • Example: corr_threshold=0.9

Outlier handling

  • outlier_method (default: None)

    • What it does: Enables outlier detection.
    • Allowed: None, 'iqr', 'zscore' (also 'z-score', 'z_score').
    • Use case: Use 'iqr' for non-normal data; 'zscore' for roughly normal distributions.
    • Example: outlier_method="iqr"
  • outlier_threshold (default: 1.5)

    • What it does: Threshold used by outlier method.
    • Use case: Increase to keep more rows, decrease to be stricter.
    • Example: outlier_threshold=3.0 (common for z-score)
  • cap_outliers (default: False)

    • What it does: Caps outliers to bounds instead of dropping rows.
    • Use case: Set True when you want to preserve dataset size.
    • Example: cap_outliers=True

Datetime features

  • extract_datetime (default: False)

    • What it does: Parses datetime-like columns and extracts year/month/day/weekday/hour.
    • Use case: Enable when date fields carry predictive signal.
    • Example: extract_datetime=True
  • drop_datetime_original (default: False)

    • What it does: Drops original datetime columns after extraction.
    • Use case: Keep only engineered datetime parts to simplify model input.
    • Example: drop_datetime_original=True

Target encoding and output format

  • target_encode (default: False)

    • What it does: Applies mean target encoding to categorical features.
    • Use case: Helpful for high-cardinality categorical variables.
    • Important: Requires target_column; avoid leakage by fitting only on training data in production workflows.
    • Example: target_encode=True
  • structured_output (default: True)

    • What it does: Controls return format.
    • If True: returns { 'X', 'y', 'feature_names', 'info' }.
    • If False: returns tuple(s) (X, y, feature_names or X, feature_names).
    • Use case: Keep True for debugging and pipeline introspection.
  • verbose (default: True)

    • What it does: Prints detailed preprocessing diagnostics.
    • Use case: Set False for cleaner logs in training pipelines.
    • Example: verbose=False

Notes

  • If you use Excel input, keep openpyxl installed.
  • If target_encode=True, provide a valid target_column.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autosweep_preprocessing-0.1.2.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autosweep_preprocessing-0.1.2-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file autosweep_preprocessing-0.1.2.tar.gz.

File metadata

  • Download URL: autosweep_preprocessing-0.1.2.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for autosweep_preprocessing-0.1.2.tar.gz
Algorithm Hash digest
SHA256 72f72de2a06b4876ae6486bb8c6f626096871008d7c9e193954acebd8ae0702b
MD5 eeaf35000619f4007c2ea3a2558fe98a
BLAKE2b-256 2da05baf11d4f1f0d20a4e9b398b531fb49811299e47ec1879915cb1e518529b

See more details on using hashes here.

File details

Details for the file autosweep_preprocessing-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for autosweep_preprocessing-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 508cdc86b278fad47e6f9e4ba6fb21de89ec41a9c261a920b056a47131d13886
MD5 314d73d1918d0f4922933005ca495324
BLAKE2b-256 557586803bccd5c6919c21a4b8331902957338fdc16613391030cecd55adaeb9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page