Skip to main content

Flexible tabular data preprocessing utility with a single AutoSweep API

Project description

autosweep-preprocessing

A lightweight preprocessing library built around a single flexible API: AutoSweep.

Usage

from autosweep_preprocessing import AutoSweep

result = AutoSweep(
    file_path="data.csv",
    target_column="target",
    encode_categorical="onehot",
    remove_correlated=True,
    structured_output=True,
)

X = result["X"]
y = result["y"]
info = result["info"]

Function

AutoSweep supports:

  • CSV/Excel loading
  • Missing value handling and imputation
  • Numeric scaling (standard, minmax, robust)
  • Categorical encoding (onehot, ordinal, label)
  • Optional datetime feature extraction
  • Optional outlier handling (iqr, zscore)
  • Optional correlation and low-variance filtering
  • Structured output for pipeline diagnostics

AutoSweep Arguments Guide

Required / Core

  • file_path (required)

    • What it does: Path to input dataset (.csv or Excel file).
    • Use case: Point to your raw training file before preprocessing.
    • Example: file_path="data/train.csv"
  • target_column (default: None)

    • What it does: Separates target variable from features and returns it as y.
    • Use case: Set this when you want to train/evaluate models after preprocessing.
    • Example: target_column="price"

Column cleaning

  • drop_columns (default: None)

    • What it does: Drops specific columns by name.
    • Use case: Remove IDs, leakage columns, or metadata fields.
    • Example: drop_columns=["id", "created_at"]
  • drop_threshold (default: 1.0)

    • What it does: Drops columns whose missing-value fraction is greater than this threshold.
    • Use case: Use 0.4/0.5 to remove heavily incomplete columns.
    • Example: drop_threshold=0.5

Missing values

  • impute_strategy_num (default: 'mean')

    • What it does: Numeric imputation strategy.
    • Allowed: 'mean', 'median', 'most_frequent', 'constant', 'knn', 'mode'.
    • Use case: Use 'median' for skewed numeric data, 'knn' for richer local patterns.
    • Example: impute_strategy_num="median"
  • impute_strategy_cat (default: 'most_frequent')

    • What it does: Categorical imputation strategy.
    • Allowed: any SimpleImputer categorical strategy (commonly 'most_frequent', 'constant').
    • Use case: Use 'most_frequent' for stable categories.
    • Example: impute_strategy_cat="most_frequent"

Scaling and encoding

  • scaler (default: 'standard')

    • What it does: Scales numeric features.
    • Allowed: 'standard', 'minmax', 'robust', or any other value for passthrough.
    • Use case: Use 'robust' when outliers are present.
    • Example: scaler="robust"
  • encode_categorical (default: None)

    • What it does: Encodes categorical columns.
    • Allowed: None, 'none', 'passthrough', 'onehot', 'ordinal', 'label'.
    • Use case: Use 'onehot' for linear/tree models; 'label' for compact numeric conversion.
    • Example: encode_categorical="onehot"

Feature selection

  • remove_low_variance (default: False)

    • What it does: Removes low-variance numeric features after preprocessing.
    • Use case: Enable when many near-constant numeric features exist.
    • Example: remove_low_variance=True
  • variance_thresh (default: 0.0)

    • What it does: Variance cutoff used by low-variance filtering.
    • Use case: Increase (e.g., 0.01) to remove weak/noisy features.
    • Example: variance_thresh=0.01
  • remove_correlated (default: False)

    • What it does: Drops highly correlated numeric features.
    • Use case: Reduce multicollinearity and redundant columns.
    • Example: remove_correlated=True
  • corr_threshold (default: 0.95)

    • What it does: Absolute correlation threshold for dropping features.
    • Use case: Use 0.85-0.95 depending on how aggressively you want feature pruning.
    • Example: corr_threshold=0.9

Outlier handling

  • outlier_method (default: None)

    • What it does: Enables outlier detection.
    • Allowed: None, 'iqr', 'zscore' (also 'z-score', 'z_score').
    • Use case: Use 'iqr' for non-normal data; 'zscore' for roughly normal distributions.
    • Example: outlier_method="iqr"
  • outlier_threshold (default: 1.5)

    • What it does: Threshold used by outlier method.
    • Use case: Increase to keep more rows, decrease to be stricter.
    • Example: outlier_threshold=3.0 (common for z-score)
  • cap_outliers (default: False)

    • What it does: Caps outliers to bounds instead of dropping rows.
    • Use case: Set True when you want to preserve dataset size.
    • Example: cap_outliers=True

Datetime features

  • extract_datetime (default: False)

    • What it does: Parses datetime-like columns and extracts year/month/day/weekday/hour.
    • Use case: Enable when date fields carry predictive signal.
    • Example: extract_datetime=True
  • drop_datetime_original (default: False)

    • What it does: Drops original datetime columns after extraction.
    • Use case: Keep only engineered datetime parts to simplify model input.
    • Example: drop_datetime_original=True

Target encoding and output format

  • target_encode (default: False)

    • What it does: Applies mean target encoding to categorical features.
    • Use case: Helpful for high-cardinality categorical variables.
    • Important: Requires target_column; avoid leakage by fitting only on training data in production workflows.
    • Example: target_encode=True
  • structured_output (default: True)

    • What it does: Controls return format.
    • If True: returns { 'X', 'y', 'feature_names', 'info' }.
    • If False: returns tuple(s) (X, y, feature_names or X, feature_names).
    • Use case: Keep True for debugging and pipeline introspection.
  • verbose (default: True)

    • What it does: Prints detailed preprocessing diagnostics.
    • Use case: Set False for cleaner logs in training pipelines.
    • Example: verbose=False

Notes

  • If you use Excel input, keep openpyxl installed.
  • If target_encode=True, provide a valid target_column.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autosweep_preprocessing-0.1.1.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autosweep_preprocessing-0.1.1-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file autosweep_preprocessing-0.1.1.tar.gz.

File metadata

  • Download URL: autosweep_preprocessing-0.1.1.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for autosweep_preprocessing-0.1.1.tar.gz
Algorithm Hash digest
SHA256 33cc5b4d8212e23e6eaffc570579c4ffee1477e4c92d2fd7b77ed87c60ce9564
MD5 cc7a92fb2b3c8efd24f68c6dc00cdf6a
BLAKE2b-256 8be7e08415ca671a1a0924c62bd050e4f0825b0b4cf61013da29d34618f241da

See more details on using hashes here.

File details

Details for the file autosweep_preprocessing-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for autosweep_preprocessing-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 78da8300d81295ed63ff36b142d4771282c93acc0fc465a604c380e5433bcff1
MD5 2543d3516795dd40bd0f61f953a9bef9
BLAKE2b-256 421256f6b6a1fac248eea87989d2016fe81f6827b0c117a8721a3c1596b01050

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page