Skip to main content

A Python package for cleaning and preprocessing data in pandas DataFrames

Project description

Certainly! Here's an updated README file for your DataScrub package:


DataScrub (v2.0)

DataScrub is an enterprise-grade Python package that provides powerful data cleaning, feature engineering, and memory-optimized preprocessing capabilities for pandas DataFrames. It is designed to handle everything from basic string formatting to massive, out-of-core big data pipelines natively integrated with Scikit-Learn.

Key Features

  1. Massive Data Optimization:
    • PyArrow Backend: Opt-in C++ memory architectures (use_pyarrow=True) to accelerate string processing by 3x-4x.
    • Out-of-Core Chunking: Pass files larger than RAM natively. DataClean('10GB_file.csv', chunksize=100000) creates a generator pipeline that sequentially cleans and writes to disk without crashing.
    • Memory Downcasting: Automatically detects strictly bounded float64/int64 arrays and shrinks them to float32/int8 via .downcast().
  2. Scikit-Learn Integration: DataClean extends BaseEstimator and TransformerMixin. It cleanly passes fit_transform() validations, meaning you can place DataScrub directly inside an sklearn.pipeline.Pipeline().
  3. Advanced Feature Engineering:
    • Regex-based noise stripping (HTML, URLs, Punctuation).
    • Time-series temporal unpacking (YYYY-MM-DD to Year, Month, Day, Is_Weekend features).
    • Machine Learning Categorical Encodings (One-Hot, Label, and Target bounding).
  4. Data Profiling: Generate immediate sparsity, variance, skewness, and payload reports using .summary().

Installation

DataScrub natively supports Python 3.10+ and integrates seamlessly with pandas 2.0+ and scikit-learn.

pip install datascrub

Basic Usage

To use DataScrub in your Python projects, import the package and create an instance of the DataClean class:

from datascrub.cleaner import DataClean
import pandas as pd

# Load standard in-memory DataFrame (or pass the string path directly for chunking!)
df = pd.read_csv("data.csv")
cleaner = DataClean(df, use_pyarrow=True) # Opt-in for C++ speeds

# Execute a massive pipeline combining imputation, outlier clipping, and feature engineering
cleaned_data = cleaner.prep(
    clean='all', 
    missing_values={'Age': 'fill with median', 'Salary': 'knn-imputer'}, 
    outliers_method='iqr',
    noise_columns=['User_Bio'],
    encoding_method='one-hot',
    encoding_columns=['City']
)

# View telemetry
cleaner.summary()

Scikit-Learn Pipeline Usage

from sklearn.pipeline import Pipeline
from datascrub.cleaner import DataClean
from sklearn.ensemble import RandomForestClassifier

# Configure DataClean cleanly using kwargs
cleaner = DataClean(clean='all', encoding_method='label', encoding_columns=['Status'])

# Bind it directly into an ML pipeline
pipe = Pipeline([
    ('scrubber', cleaner),
    ('classifier', RandomForestClassifier())
])

# Fit on training sets dynamically
# pipe.fit(X_train, y_train)

Comprehensive API Guide

The DataClean class is extremely robust. Below is a complete reference to all the capabilities you can invoke through the prep() orchestrator or by passing arguments to DataClean(obj, **kwargs).

Memory & File Handling Configurations

When initializing DataClean(), you have multiple advanced IO options:

  • obj (pd.DataFrame, str): Pass an in-memory Pandas dataframe, or provide a string pointing to a local .csv or .xlsx file. A string file path is required to utilize chunking.
  • use_pyarrow (bool): Set to True to swap the backend to C++ PyArrow (drastically speeds up string evaluation and lowers memory footprints). Defaults to False.
  • chunksize (int): If provided, DataScrub acts as an out-of-core generator, processing the target .csv file in batches of chunksize rows and safely pushing the consolidated memory-light output to a centralized *_cleaned.csv on your disk. (Auto-enables if the dataset is >100MB).

Telemetry Profiling

  • .summary(): Prints a detailed dataframe profile outlining strictly typed Memory Usage (in MBs), dataset shape, sparsity (percentage of NaNs), variance, and skewness across all columns.
  • .downcast(): Strictly bounded numeric detection. Scans the dataframe and intelligently compresses unoptimized float64 / int64 datatypes to their tightest binary limits (e.g., int8, float32).

prep() Parameters

The prep() method orchestrates the precise execution order of your data engineering pipeline. It accepts the following arguments:

Standard Cleaning

  • clean (str, list): Defaults to 'all'. Cleans string columns by stripping leading/trailing whitespace, converting text to lowercase, and demojizing emojis.

Imputation & Missing Values

  • missing_values (dict): Provide a dictionary where the key is the column name and the value is the imputation strategy.
    • Example Options: 'drop', 'fill with mean', 'fill with median', 'fill with mode', 'fill with backward fill along columns/rows'.
    • Machine Learning Strategies: Provide 'knn-imputer' (utilizes 5-neighbors via sklearn) or 'iterative-imputer' (utilizes MICE via sklearn) for multivariate feature-aware imputing.

Analytics Extrapolations

  • parse_date (list): Provide a list of columns to natively cast to datetime64[ns] formatted rigidly as YYYY-MM-DD.
  • extract_datetime (list): Explodes a date-string column into four distinct integers for Machine Learning: [Col]_Year, [Col]_Month, [Col]_Day, and [Col]_Is_Weekend.
  • explode (dict): Splits string values based on delimiters and cascades them downward into distinct rows. Example: {'Tags': ','} splits 'A,B' into row 1 'A' and row 2 'B'.
  • translate_column_names (dict): Utilizes googletrans API to asynchronously translate string records to English. Set dictionary boolean values to True to overwrite the existing column, or False to generate a dedicated [Col]_translated feature.

Feature Engineering & Mitigations

  • noise_columns (list): A vector of columns to strip messy HTML tags (e.g., <p>foo</p>), broken URL requests, and arbitrary string punctuation via C-level Regex patterns.
  • outliers_method (str): Mitigate extreme outlier biases by capping/clipping boundaries instead of destroying row parity. Options are 'iqr' and 'z-score'.
  • outlier_columns (list): Define bounds. Defaults to 'all' numeric arrays.
  • perform_scaling_normalization_bool (bool): Applies strict mathematical Box-Cox transformations across numeric matrices. Note: Shifts non-positive values cleanly to min+1 iteratively.

Machine Learning Encoding

  • encoding_method (str): Translates strings/categorical classes into ML-compatible features natively relying on scikit-learn. Options:
    • 'one-hot': Expands columns to dummy boundaries (drop_first=True enabled).
    • 'label': Binds strict Ordinal IDs to unique text values globally.
    • 'target': Maps distinct feature ratios based precisely on bounding biases via Target Encoded boundaries.
  • encoding_columns (list): Columns to evaluate. Defaults to 'all'.
  • target_col (str): Required strictly if utilizing target encoding.

Contributing

Contributions to DataScrub are welcome! If you encounter any bugs, have suggestions for improvements, or would like to add new features, please open an issue or submit a pull request on the GitHub repository.

License

This project is licensed under the MIT License. See the LICENSE file for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datascrub-2.0.0.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datascrub-2.0.0-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file datascrub-2.0.0.tar.gz.

File metadata

  • Download URL: datascrub-2.0.0.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for datascrub-2.0.0.tar.gz
Algorithm Hash digest
SHA256 145b617ccc74b81d31e1488f6a7fc35cd3561817e475c7f2cad4a5f483361ff6
MD5 7b613497470954d789e50e338ecdb46d
BLAKE2b-256 4cddf47ae2ed7ac752946cb3ffe5bd1d7e9d9e7566e0fae9e5f30d69755fb682

See more details on using hashes here.

File details

Details for the file datascrub-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: datascrub-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for datascrub-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 989bcaa8d3747806d37cb72eafb9636ae7f47c8c6be1b60bc944c4178153de71
MD5 131070783863a02e7e9c0cf0d2f058cc
BLAKE2b-256 27b318a2837c31943b35850b79475b0f2ec59fefaee2aabec1f40a1c5d2e9010

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page