A Python package for cleaning and preprocessing data in pandas DataFrames
Project description
DataScrub (v2.0)
DataScrub is a powerful, straightforward data cleaning and feature engineering toolkit for pandas DataFrames. Designed for speed and memory efficiency, it allows you to easily handle basic string formatting or scale up to out-of-core big data pipelines natively integrated with Scikit-Learn.
Key Features
- Massive Data Optimization:
- PyArrow Backend: Enable the C++ PyArrow backend (
use_pyarrow=True) to process string columns significantly faster. - Out-of-Core Processing: Process files larger than your RAM. Passing a file path with
chunksize=100000creates a generator that cleans data in batches and writes directly to disk. - Memory Downcasting: Use
.downcast()to automatically shrinkfloat64/int64columns tofloat32/int8where possible, drastically reducing memory usage.
- PyArrow Backend: Enable the C++ PyArrow backend (
- Scikit-Learn Integration:
DataCleanextendsBaseEstimatorandTransformerMixin, so it acts as standard transformer. You can drop it directly into asklearn.pipeline.Pipeline(). - Advanced Feature Engineering:
- Regex-based text cleaning (removes HTML, URLs, and punctuation).
- Date feature extraction (splits
YYYY-MM-DDinto Year, Month, Day, and Is_Weekend columns). - Built-in categorical encodings (One-Hot, Label, and Target encoding).
- Data Profiling: Run
.summary()to quickly print out sparsity, memory usage, variance, and skewness for all columns.
Installation
DataScrub natively supports Python 3.10+ and integrates seamlessly with pandas 2.0+ and scikit-learn.
pip install datascrub
Basic Usage
To use DataScrub in your Python projects, import the package and create an instance of the DataClean class:
from datascrub.cleaner import DataClean
import pandas as pd
# Load standard in-memory DataFrame (or pass the string path directly for chunking!)
df = pd.read_csv("data.csv")
cleaner = DataClean(df, use_pyarrow=True) # Opt-in for C++ speeds
# Execute a massive pipeline combining imputation, outlier clipping, and feature engineering
cleaned_data = cleaner.prep(
clean='all',
missing_values={'Age': 'fill with median', 'Salary': 'knn-imputer'},
outliers_method='iqr',
noise_columns=['User_Bio'],
encoding_method='one-hot',
encoding_columns=['City']
)
# View telemetry
cleaner.summary()
Scikit-Learn Pipeline Usage
from sklearn.pipeline import Pipeline
from datascrub.cleaner import DataClean
from sklearn.ensemble import RandomForestClassifier
# Set up the cleaner with your chosen parameters
cleaner = DataClean(clean='all', encoding_method='label', encoding_columns=['Status'])
# Include it in your pipeline
pipe = Pipeline([
('scrubber', cleaner),
('classifier', RandomForestClassifier())
])
# Fit and predict as usual
# pipe.fit(X_train, y_train)
API & Configuration Guide
Below is a quick reference for the arguments you can pass to the DataClean constructor and the prep() method.
Initialization & Memory Handlers
When creating an instance DataClean(), you have access to the following IO and memory arguments:
obj(pd.DataFrame, str): Provide a Pandas DataFrame in memory, or a file path (string) pointing to a.csvor.xlsxfile. To use out-of-core chunking, you must pass a file path instead of a DataFrame.use_pyarrow(bool): Set toTrueto convert underlying datatypes to PyArrow. Greatly speeds up string operations and reduces memory footprints. Defaults toFalse.chunksize(int): Pass an integer to process large files out-of-core. DataScrub will read the file in chunks of this size, process them independently, and append the results to a*_cleaned.csvfile on disk. (Automatically enables if the file size is >100MB).
Telemetry & Inspection
.summary(): Prints a profile of your DataFrame. Shows Memory Usage (MB), dataset shape, sparsity (% of NaN values), variance, and skewness..downcast(): Scans the numeric columns and downcasts them to the smallest possible type (e.g., changingfloat64tofloat32), saving RAM.
prep() Arguments
The prep() method handles the actual data cleaning and transformation pipeline. You can provide the following arguments:
General Cleaning
clean(str, list): Defaults to'all'. Cleans up string columns by trimming whitespace, casting to lowercase, and handling emojis.
Missing Values
missing_values(dict): Dictionary mapping column names to your chosen imputation technique.- Options include:
'drop','fill with mean','fill with median','fill with mode','fill with backward fill along columns/rows'. - ML Strategies: Use
'knn-imputer'(utilizes sklearn's KNNImputer with 5 neighbors) or'iterative-imputer'(sklearn's IterativeImputer) for feature-aware filling.
- Options include:
Transformations
parse_date(list): List of columns to cast todatetime64[ns](expects the formatYYYY-MM-DD).extract_datetime(list): List of date columns to explode into integer features:[Col]_Year,[Col]_Month,[Col]_Day, and[Col]_Is_Weekend.explode(dict): Dict for splitting delimited string values into separate rows. Example:{'Tags': ','}turns a single'A,B'row into two rows:'A'and'B'.translate_column_names(dict): Maps columns to boolean values for translation to English viagoogletrans. SetTrueto overwrite the existing column, orFalseto create a new[Col]_translatedcolumn.
Feature Engineering & Outliers
noise_columns(list): List of columns where you want to strip out HTML tags, URLs, and arbitrary punctuation via Regex.outliers_method(str): Technique used to cap extreme outliers. Options are'iqr'and'z-score'.outlier_columns(list): Columns to check for outliers. Defaults to'all'numeric columns.perform_scaling_normalization_bool(bool): IfTrue, applies a Box-Cox transformation to normalise numeric distributions.
Encodings
encoding_method(str): Strategy for transforming strings/categories into numeric features viascikit-learn.'one-hot': Creates dummy variables (drops the first column to avoid collinearity).'label': Binds an integer ID to each unique text value.'target': Uses Target Encoding based on the target column's distribution.
encoding_columns(list): Columns to encode. Defaults to'all'.target_col(str): Required if you are utilizing'target'encoding.
Contributing
Contributions to DataScrub are welcome! If you encounter any bugs, have suggestions for improvements, or would like to add new features, please open an issue or submit a pull request on the GitHub repository.
License
This project is licensed under the MIT License. See the LICENSE file for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datascrub-2.0.1.tar.gz.
File metadata
- Download URL: datascrub-2.0.1.tar.gz
- Upload date:
- Size: 16.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c039a84217aff953b2b1d051675aaac3969ca1c11efece3d5a75f54488fe5c74
|
|
| MD5 |
a3a9826f9f46b3d49137d834d3016988
|
|
| BLAKE2b-256 |
94afd2227997bc844786cf3ad7a884200337dcc3efd392982ec6e8faaa243b6d
|
File details
Details for the file datascrub-2.0.1-py3-none-any.whl.
File metadata
- Download URL: datascrub-2.0.1-py3-none-any.whl
- Upload date:
- Size: 11.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
837e03e0eca6862c42e029a9deef65dc01edf474fb43a9a8eebde40d7c6bff4d
|
|
| MD5 |
5cbd69c6661b7d3ef8c8d10f0beb548f
|
|
| BLAKE2b-256 |
ba72e8d82c08f55da42191c7244e3630b8d10accbe310205f2653e36f8630b93
|