A Python package for cleaning and preprocessing data in pandas DataFrames

These details have not been verified by PyPI

Project links

Homepage

Project description

DataScrub (v2.0)

DataScrub is a powerful, straightforward data cleaning and feature engineering toolkit for pandas DataFrames. Designed for speed and memory efficiency, it allows you to easily handle basic string formatting or scale up to out-of-core big data pipelines natively integrated with Scikit-Learn.

Key Features

Massive Data Optimization:
- PyArrow Backend: Enable the C++ PyArrow backend (use_pyarrow=True) to process string columns significantly faster.
- Out-of-Core Processing: Process files larger than your RAM. Passing a file path with chunksize=100000 creates a generator that cleans data in batches and writes directly to disk.
- Memory Downcasting: Use .downcast() to automatically shrink float64/int64 columns to float32/int8 where possible, drastically reducing memory usage.
Scikit-Learn Integration: DataClean extends BaseEstimator and TransformerMixin, so it acts as standard transformer. You can drop it directly into a sklearn.pipeline.Pipeline().
Advanced Feature Engineering:
- Regex-based text cleaning (removes HTML, URLs, and punctuation).
- Date feature extraction (splits YYYY-MM-DD into Year, Month, Day, and Is_Weekend columns).
- Built-in categorical encodings (One-Hot, Label, and Target encoding).
Data Profiling: Run .summary() to quickly print out sparsity, memory usage, variance, and skewness for all columns.

Installation

DataScrub natively supports Python 3.10+ and integrates seamlessly with pandas 2.0+ and scikit-learn.

pip install datascrub

Basic Usage

To use DataScrub in your Python projects, import the package and create an instance of the DataClean class:

from datascrub.cleaner import DataClean
import pandas as pd

# Load standard in-memory DataFrame (or pass the string path directly for chunking!)
df = pd.read_csv("data.csv")
cleaner = DataClean(df, use_pyarrow=True) # Opt-in for C++ speeds

# Execute a massive pipeline combining imputation, outlier clipping, and feature engineering
cleaned_data = cleaner.prep(
    clean='all', 
    missing_values={'Age': 'fill with median', 'Salary': 'knn-imputer'}, 
    outliers_method='iqr',
    noise_columns=['User_Bio'],
    encoding_method='one-hot',
    encoding_columns=['City']
)

# View telemetry
cleaner.summary()

Scikit-Learn Pipeline Usage

from sklearn.pipeline import Pipeline
from datascrub.cleaner import DataClean
from sklearn.ensemble import RandomForestClassifier

# Set up the cleaner with your chosen parameters
cleaner = DataClean(clean='all', encoding_method='label', encoding_columns=['Status'])

# Include it in your pipeline
pipe = Pipeline([
    ('scrubber', cleaner),
    ('classifier', RandomForestClassifier())
])

# Fit and predict as usual
# pipe.fit(X_train, y_train)

API & Configuration Guide

Below is a quick reference for the arguments you can pass to the DataClean constructor and the prep() method.

Initialization & Memory Handlers

When creating an instance DataClean(), you have access to the following IO and memory arguments:

obj (pd.DataFrame, str): Provide a Pandas DataFrame in memory, or a file path (string) pointing to a .csv or .xlsx file. To use out-of-core chunking, you must pass a file path instead of a DataFrame.
use_pyarrow (bool): Set to True to convert underlying datatypes to PyArrow. Greatly speeds up string operations and reduces memory footprints. Defaults to False.
chunksize (int): Pass an integer to process large files out-of-core. DataScrub will read the file in chunks of this size, process them independently, and append the results to a *_cleaned.csv file on disk. (Automatically enables if the file size is >100MB).

Telemetry & Inspection

.summary(): Prints a profile of your DataFrame. Shows Memory Usage (MB), dataset shape, sparsity (% of NaN values), variance, and skewness.
.downcast(): Scans the numeric columns and downcasts them to the smallest possible type (e.g., changing float64 to float32), saving RAM.

`prep()` Arguments

The prep() method handles the actual data cleaning and transformation pipeline. You can provide the following arguments:

General Cleaning

clean (str, list): Defaults to 'all'. Cleans up string columns by trimming whitespace, casting to lowercase, and handling emojis.

Missing Values

missing_values (dict): Dictionary mapping column names to your chosen imputation technique.
- Options include: 'drop', 'fill with mean', 'fill with median', 'fill with mode', 'fill with backward fill along columns/rows'.
- ML Strategies: Use 'knn-imputer' (utilizes sklearn's KNNImputer with 5 neighbors) or 'iterative-imputer' (sklearn's IterativeImputer) for feature-aware filling.

Transformations

parse_date (list): List of columns to cast to datetime64[ns] (expects the format YYYY-MM-DD).
extract_datetime (list): List of date columns to explode into integer features: [Col]_Year, [Col]_Month, [Col]_Day, and [Col]_Is_Weekend.
explode (dict): Dict for splitting delimited string values into separate rows. Example: {'Tags': ','} turns a single 'A,B' row into two rows: 'A' and 'B'.
translate_column_names (dict): Maps columns to boolean values for translation to English via googletrans. Set True to overwrite the existing column, or False to create a new [Col]_translated column.

Feature Engineering & Outliers

noise_columns (list): List of columns where you want to strip out HTML tags, URLs, and arbitrary punctuation via Regex.
outliers_method (str): Technique used to cap extreme outliers. Options are 'iqr' and 'z-score'.
outlier_columns (list): Columns to check for outliers. Defaults to 'all' numeric columns.
perform_scaling_normalization_bool (bool): If True, applies a Box-Cox transformation to normalise numeric distributions.

Encodings

encoding_method (str): Strategy for transforming strings/categories into numeric features via scikit-learn.
- 'one-hot': Creates dummy variables (drops the first column to avoid collinearity).
- 'label': Binds an integer ID to each unique text value.
- 'target': Uses Target Encoding based on the target column's distribution.
encoding_columns (list): Columns to encode. Defaults to 'all'.
target_col (str): Required if you are utilizing 'target' encoding.

Contributing

Contributions to DataScrub are welcome! If you encounter any bugs, have suggestions for improvements, or would like to add new features, please open an issue or submit a pull request on the GitHub repository.

License

This project is licensed under the MIT License. See the LICENSE file for more information.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2.0.1

Mar 7, 2026

2.0.0

Mar 7, 2026

1.1.5

Jul 3, 2023

1.1.4

Jul 3, 2023

1.1.3

Jul 3, 2023

1.1.2

Jun 22, 2023

1.1.1

Jun 22, 2023

1.1.0

Jun 22, 2023

1.0.1

Jun 22, 2023

1.0.1b0 pre-release

Jun 22, 2023

1.0.1a0 pre-release

Jun 22, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datascrub-2.0.1.tar.gz (16.5 kB view details)

Uploaded Mar 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datascrub-2.0.1-py3-none-any.whl (11.6 kB view details)

Uploaded Mar 7, 2026 Python 3

File details

Details for the file datascrub-2.0.1.tar.gz.

File metadata

Download URL: datascrub-2.0.1.tar.gz
Upload date: Mar 7, 2026
Size: 16.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for datascrub-2.0.1.tar.gz
Algorithm	Hash digest
SHA256	`c039a84217aff953b2b1d051675aaac3969ca1c11efece3d5a75f54488fe5c74`
MD5	`a3a9826f9f46b3d49137d834d3016988`
BLAKE2b-256	`94afd2227997bc844786cf3ad7a884200337dcc3efd392982ec6e8faaa243b6d`

See more details on using hashes here.

File details

Details for the file datascrub-2.0.1-py3-none-any.whl.

File metadata

Download URL: datascrub-2.0.1-py3-none-any.whl
Upload date: Mar 7, 2026
Size: 11.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for datascrub-2.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`837e03e0eca6862c42e029a9deef65dc01edf474fb43a9a8eebde40d7c6bff4d`
MD5	`5cbd69c6661b7d3ef8c8d10f0beb548f`
BLAKE2b-256	`ba72e8d82c08f55da42191c7244e3630b8d10accbe310205f2653e36f8630b93`

See more details on using hashes here.

datascrub 2.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataScrub (v2.0)

Key Features

Installation

Basic Usage

Scikit-Learn Pipeline Usage

API & Configuration Guide

Initialization & Memory Handlers

Telemetry & Inspection

prep() Arguments

General Cleaning

Missing Values

Transformations

Feature Engineering & Outliers

Encodings

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`prep()` Arguments