A Python package for cleaning and preprocessing data in pandas DataFrames

These details have not been verified by PyPI

Project links

Homepage

Project description

Certainly! Here's an updated README file for your DataScrub package:

DataScrub (v2.0)

DataScrub is an enterprise-grade Python package that provides powerful data cleaning, feature engineering, and memory-optimized preprocessing capabilities for pandas DataFrames. It is designed to handle everything from basic string formatting to massive, out-of-core big data pipelines natively integrated with Scikit-Learn.

Key Features

Massive Data Optimization:
- PyArrow Backend: Opt-in C++ memory architectures (use_pyarrow=True) to accelerate string processing by 3x-4x.
- Out-of-Core Chunking: Pass files larger than RAM natively. DataClean('10GB_file.csv', chunksize=100000) creates a generator pipeline that sequentially cleans and writes to disk without crashing.
- Memory Downcasting: Automatically detects strictly bounded float64/int64 arrays and shrinks them to float32/int8 via .downcast().
Scikit-Learn Integration: DataClean extends BaseEstimator and TransformerMixin. It cleanly passes fit_transform() validations, meaning you can place DataScrub directly inside an sklearn.pipeline.Pipeline().
Advanced Feature Engineering:
- Regex-based noise stripping (HTML, URLs, Punctuation).
- Time-series temporal unpacking (YYYY-MM-DD to Year, Month, Day, Is_Weekend features).
- Machine Learning Categorical Encodings (One-Hot, Label, and Target bounding).
Data Profiling: Generate immediate sparsity, variance, skewness, and payload reports using .summary().

Installation

DataScrub natively supports Python 3.10+ and integrates seamlessly with pandas 2.0+ and scikit-learn.

pip install datascrub

Basic Usage

To use DataScrub in your Python projects, import the package and create an instance of the DataClean class:

from datascrub.cleaner import DataClean
import pandas as pd

# Load standard in-memory DataFrame (or pass the string path directly for chunking!)
df = pd.read_csv("data.csv")
cleaner = DataClean(df, use_pyarrow=True) # Opt-in for C++ speeds

# Execute a massive pipeline combining imputation, outlier clipping, and feature engineering
cleaned_data = cleaner.prep(
    clean='all', 
    missing_values={'Age': 'fill with median', 'Salary': 'knn-imputer'}, 
    outliers_method='iqr',
    noise_columns=['User_Bio'],
    encoding_method='one-hot',
    encoding_columns=['City']
)

# View telemetry
cleaner.summary()

Scikit-Learn Pipeline Usage

from sklearn.pipeline import Pipeline
from datascrub.cleaner import DataClean
from sklearn.ensemble import RandomForestClassifier

# Configure DataClean cleanly using kwargs
cleaner = DataClean(clean='all', encoding_method='label', encoding_columns=['Status'])

# Bind it directly into an ML pipeline
pipe = Pipeline([
    ('scrubber', cleaner),
    ('classifier', RandomForestClassifier())
])

# Fit on training sets dynamically
# pipe.fit(X_train, y_train)

Comprehensive API Guide

The DataClean class is extremely robust. Below is a complete reference to all the capabilities you can invoke through the prep() orchestrator or by passing arguments to DataClean(obj, **kwargs).

Memory & File Handling Configurations

When initializing DataClean(), you have multiple advanced IO options:

obj (pd.DataFrame, str): Pass an in-memory Pandas dataframe, or provide a string pointing to a local .csv or .xlsx file. A string file path is required to utilize chunking.
use_pyarrow (bool): Set to True to swap the backend to C++ PyArrow (drastically speeds up string evaluation and lowers memory footprints). Defaults to False.
chunksize (int): If provided, DataScrub acts as an out-of-core generator, processing the target .csv file in batches of chunksize rows and safely pushing the consolidated memory-light output to a centralized *_cleaned.csv on your disk. (Auto-enables if the dataset is >100MB).

Telemetry Profiling

.summary(): Prints a detailed dataframe profile outlining strictly typed Memory Usage (in MBs), dataset shape, sparsity (percentage of NaNs), variance, and skewness across all columns.
.downcast(): Strictly bounded numeric detection. Scans the dataframe and intelligently compresses unoptimized float64 / int64 datatypes to their tightest binary limits (e.g., int8, float32).

`prep()` Parameters

The prep() method orchestrates the precise execution order of your data engineering pipeline. It accepts the following arguments:

Standard Cleaning

clean (str, list): Defaults to 'all'. Cleans string columns by stripping leading/trailing whitespace, converting text to lowercase, and demojizing emojis.

Imputation & Missing Values

missing_values (dict): Provide a dictionary where the key is the column name and the value is the imputation strategy.
- Example Options: 'drop', 'fill with mean', 'fill with median', 'fill with mode', 'fill with backward fill along columns/rows'.
- Machine Learning Strategies: Provide 'knn-imputer' (utilizes 5-neighbors via sklearn) or 'iterative-imputer' (utilizes MICE via sklearn) for multivariate feature-aware imputing.

Analytics Extrapolations

parse_date (list): Provide a list of columns to natively cast to datetime64[ns] formatted rigidly as YYYY-MM-DD.
extract_datetime (list): Explodes a date-string column into four distinct integers for Machine Learning: [Col]_Year, [Col]_Month, [Col]_Day, and [Col]_Is_Weekend.
explode (dict): Splits string values based on delimiters and cascades them downward into distinct rows. Example: {'Tags': ','} splits 'A,B' into row 1 'A' and row 2 'B'.
translate_column_names (dict): Utilizes googletrans API to asynchronously translate string records to English. Set dictionary boolean values to True to overwrite the existing column, or False to generate a dedicated [Col]_translated feature.

Feature Engineering & Mitigations

noise_columns (list): A vector of columns to strip messy HTML tags (e.g., <p>foo</p>), broken URL requests, and arbitrary string punctuation via C-level Regex patterns.
outliers_method (str): Mitigate extreme outlier biases by capping/clipping boundaries instead of destroying row parity. Options are 'iqr' and 'z-score'.
outlier_columns (list): Define bounds. Defaults to 'all' numeric arrays.
perform_scaling_normalization_bool (bool): Applies strict mathematical Box-Cox transformations across numeric matrices. Note: Shifts non-positive values cleanly to min+1 iteratively.

Machine Learning Encoding

encoding_method (str): Translates strings/categorical classes into ML-compatible features natively relying on scikit-learn. Options:
- 'one-hot': Expands columns to dummy boundaries (drop_first=True enabled).
- 'label': Binds strict Ordinal IDs to unique text values globally.
- 'target': Maps distinct feature ratios based precisely on bounding biases via Target Encoded boundaries.
encoding_columns (list): Columns to evaluate. Defaults to 'all'.
target_col (str): Required strictly if utilizing target encoding.

Contributing

Contributions to DataScrub are welcome! If you encounter any bugs, have suggestions for improvements, or would like to add new features, please open an issue or submit a pull request on the GitHub repository.

License

This project is licensed under the MIT License. See the LICENSE file for more information.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.0.1

Mar 7, 2026

This version

2.0.0

Mar 7, 2026

1.1.5

Jul 3, 2023

1.1.4

Jul 3, 2023

1.1.3

Jul 3, 2023

1.1.2

Jun 22, 2023

1.1.1

Jun 22, 2023

1.1.0

Jun 22, 2023

1.0.1

Jun 22, 2023

1.0.1b0 pre-release

Jun 22, 2023

1.0.1a0 pre-release

Jun 22, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datascrub-2.0.0.tar.gz (17.3 kB view details)

Uploaded Mar 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datascrub-2.0.0-py3-none-any.whl (12.0 kB view details)

Uploaded Mar 7, 2026 Python 3

File details

Details for the file datascrub-2.0.0.tar.gz.

File metadata

Download URL: datascrub-2.0.0.tar.gz
Upload date: Mar 7, 2026
Size: 17.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for datascrub-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`145b617ccc74b81d31e1488f6a7fc35cd3561817e475c7f2cad4a5f483361ff6`
MD5	`7b613497470954d789e50e338ecdb46d`
BLAKE2b-256	`4cddf47ae2ed7ac752946cb3ffe5bd1d7e9d9e7566e0fae9e5f30d69755fb682`

See more details on using hashes here.

File details

Details for the file datascrub-2.0.0-py3-none-any.whl.

File metadata

Download URL: datascrub-2.0.0-py3-none-any.whl
Upload date: Mar 7, 2026
Size: 12.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for datascrub-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`989bcaa8d3747806d37cb72eafb9636ae7f47c8c6be1b60bc944c4178153de71`
MD5	`131070783863a02e7e9c0cf0d2f058cc`
BLAKE2b-256	`27b318a2837c31943b35850b79475b0f2ec59fefaee2aabec1f40a1c5d2e9010`

See more details on using hashes here.

datascrub 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataScrub (v2.0)

Key Features

Installation

Basic Usage

Scikit-Learn Pipeline Usage

Comprehensive API Guide

Memory & File Handling Configurations

Telemetry Profiling

prep() Parameters

Standard Cleaning

Imputation & Missing Values

Analytics Extrapolations

Feature Engineering & Mitigations

Machine Learning Encoding

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`prep()` Parameters