Skip to main content

A Python package for cleaning and preprocessing data in pandas DataFrames

Project description

DataScrub (v2.0)

DataScrub is a powerful, straightforward data cleaning and feature engineering toolkit for pandas DataFrames. Designed for speed and memory efficiency, it allows you to easily handle basic string formatting or scale up to out-of-core big data pipelines natively integrated with Scikit-Learn.

Key Features

  1. Massive Data Optimization:
    • PyArrow Backend: Enable the C++ PyArrow backend (use_pyarrow=True) to process string columns significantly faster.
    • Out-of-Core Processing: Process files larger than your RAM. Passing a file path with chunksize=100000 creates a generator that cleans data in batches and writes directly to disk.
    • Memory Downcasting: Use .downcast() to automatically shrink float64/int64 columns to float32/int8 where possible, drastically reducing memory usage.
  2. Scikit-Learn Integration: DataClean extends BaseEstimator and TransformerMixin, so it acts as standard transformer. You can drop it directly into a sklearn.pipeline.Pipeline().
  3. Advanced Feature Engineering:
    • Regex-based text cleaning (removes HTML, URLs, and punctuation).
    • Date feature extraction (splits YYYY-MM-DD into Year, Month, Day, and Is_Weekend columns).
    • Built-in categorical encodings (One-Hot, Label, and Target encoding).
  4. Data Profiling: Run .summary() to quickly print out sparsity, memory usage, variance, and skewness for all columns.

Installation

DataScrub natively supports Python 3.10+ and integrates seamlessly with pandas 2.0+ and scikit-learn.

pip install datascrub

Basic Usage

To use DataScrub in your Python projects, import the package and create an instance of the DataClean class:

from datascrub.cleaner import DataClean
import pandas as pd

# Load standard in-memory DataFrame (or pass the string path directly for chunking!)
df = pd.read_csv("data.csv")
cleaner = DataClean(df, use_pyarrow=True) # Opt-in for C++ speeds

# Execute a massive pipeline combining imputation, outlier clipping, and feature engineering
cleaned_data = cleaner.prep(
    clean='all', 
    missing_values={'Age': 'fill with median', 'Salary': 'knn-imputer'}, 
    outliers_method='iqr',
    noise_columns=['User_Bio'],
    encoding_method='one-hot',
    encoding_columns=['City']
)

# View telemetry
cleaner.summary()

Scikit-Learn Pipeline Usage

from sklearn.pipeline import Pipeline
from datascrub.cleaner import DataClean
from sklearn.ensemble import RandomForestClassifier

# Set up the cleaner with your chosen parameters
cleaner = DataClean(clean='all', encoding_method='label', encoding_columns=['Status'])

# Include it in your pipeline
pipe = Pipeline([
    ('scrubber', cleaner),
    ('classifier', RandomForestClassifier())
])

# Fit and predict as usual
# pipe.fit(X_train, y_train)

API & Configuration Guide

Below is a quick reference for the arguments you can pass to the DataClean constructor and the prep() method.

Initialization & Memory Handlers

When creating an instance DataClean(), you have access to the following IO and memory arguments:

  • obj (pd.DataFrame, str): Provide a Pandas DataFrame in memory, or a file path (string) pointing to a .csv or .xlsx file. To use out-of-core chunking, you must pass a file path instead of a DataFrame.
  • use_pyarrow (bool): Set to True to convert underlying datatypes to PyArrow. Greatly speeds up string operations and reduces memory footprints. Defaults to False.
  • chunksize (int): Pass an integer to process large files out-of-core. DataScrub will read the file in chunks of this size, process them independently, and append the results to a *_cleaned.csv file on disk. (Automatically enables if the file size is >100MB).

Telemetry & Inspection

  • .summary(): Prints a profile of your DataFrame. Shows Memory Usage (MB), dataset shape, sparsity (% of NaN values), variance, and skewness.
  • .downcast(): Scans the numeric columns and downcasts them to the smallest possible type (e.g., changing float64 to float32), saving RAM.

prep() Arguments

The prep() method handles the actual data cleaning and transformation pipeline. You can provide the following arguments:

General Cleaning

  • clean (str, list): Defaults to 'all'. Cleans up string columns by trimming whitespace, casting to lowercase, and handling emojis.

Missing Values

  • missing_values (dict): Dictionary mapping column names to your chosen imputation technique.
    • Options include: 'drop', 'fill with mean', 'fill with median', 'fill with mode', 'fill with backward fill along columns/rows'.
    • ML Strategies: Use 'knn-imputer' (utilizes sklearn's KNNImputer with 5 neighbors) or 'iterative-imputer' (sklearn's IterativeImputer) for feature-aware filling.

Transformations

  • parse_date (list): List of columns to cast to datetime64[ns] (expects the format YYYY-MM-DD).
  • extract_datetime (list): List of date columns to explode into integer features: [Col]_Year, [Col]_Month, [Col]_Day, and [Col]_Is_Weekend.
  • explode (dict): Dict for splitting delimited string values into separate rows. Example: {'Tags': ','} turns a single 'A,B' row into two rows: 'A' and 'B'.
  • translate_column_names (dict): Maps columns to boolean values for translation to English via googletrans. Set True to overwrite the existing column, or False to create a new [Col]_translated column.

Feature Engineering & Outliers

  • noise_columns (list): List of columns where you want to strip out HTML tags, URLs, and arbitrary punctuation via Regex.
  • outliers_method (str): Technique used to cap extreme outliers. Options are 'iqr' and 'z-score'.
  • outlier_columns (list): Columns to check for outliers. Defaults to 'all' numeric columns.
  • perform_scaling_normalization_bool (bool): If True, applies a Box-Cox transformation to normalise numeric distributions.

Encodings

  • encoding_method (str): Strategy for transforming strings/categories into numeric features via scikit-learn.
    • 'one-hot': Creates dummy variables (drops the first column to avoid collinearity).
    • 'label': Binds an integer ID to each unique text value.
    • 'target': Uses Target Encoding based on the target column's distribution.
  • encoding_columns (list): Columns to encode. Defaults to 'all'.
  • target_col (str): Required if you are utilizing 'target' encoding.

Contributing

Contributions to DataScrub are welcome! If you encounter any bugs, have suggestions for improvements, or would like to add new features, please open an issue or submit a pull request on the GitHub repository.

License

This project is licensed under the MIT License. See the LICENSE file for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datascrub-2.0.1.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datascrub-2.0.1-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file datascrub-2.0.1.tar.gz.

File metadata

  • Download URL: datascrub-2.0.1.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for datascrub-2.0.1.tar.gz
Algorithm Hash digest
SHA256 c039a84217aff953b2b1d051675aaac3969ca1c11efece3d5a75f54488fe5c74
MD5 a3a9826f9f46b3d49137d834d3016988
BLAKE2b-256 94afd2227997bc844786cf3ad7a884200337dcc3efd392982ec6e8faaa243b6d

See more details on using hashes here.

File details

Details for the file datascrub-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: datascrub-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for datascrub-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 837e03e0eca6862c42e029a9deef65dc01edf474fb43a9a8eebde40d7c6bff4d
MD5 5cbd69c6661b7d3ef8c8d10f0beb548f
BLAKE2b-256 ba72e8d82c08f55da42191c7244e3630b8d10accbe310205f2653e36f8630b93

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page