A Python package for cleaning and preprocessing data in pandas DataFrames
Project description
Certainly! Here's an updated README file for your DataScrub package:
DataScrub (v2.0)
DataScrub is an enterprise-grade Python package that provides powerful data cleaning, feature engineering, and memory-optimized preprocessing capabilities for pandas DataFrames. It is designed to handle everything from basic string formatting to massive, out-of-core big data pipelines natively integrated with Scikit-Learn.
Key Features
- Massive Data Optimization:
- PyArrow Backend: Opt-in C++ memory architectures (
use_pyarrow=True) to accelerate string processing by 3x-4x. - Out-of-Core Chunking: Pass files larger than RAM natively.
DataClean('10GB_file.csv', chunksize=100000)creates a generator pipeline that sequentially cleans and writes to disk without crashing. - Memory Downcasting: Automatically detects strictly bounded
float64/int64arrays and shrinks them tofloat32/int8via.downcast().
- PyArrow Backend: Opt-in C++ memory architectures (
- Scikit-Learn Integration:
DataCleanextendsBaseEstimatorandTransformerMixin. It cleanly passesfit_transform()validations, meaning you can place DataScrub directly inside ansklearn.pipeline.Pipeline(). - Advanced Feature Engineering:
- Regex-based noise stripping (HTML, URLs, Punctuation).
- Time-series temporal unpacking (
YYYY-MM-DDto Year, Month, Day, Is_Weekend features). - Machine Learning Categorical Encodings (One-Hot, Label, and Target bounding).
- Data Profiling: Generate immediate sparsity, variance, skewness, and payload reports using
.summary().
Installation
DataScrub natively supports Python 3.10+ and integrates seamlessly with pandas 2.0+ and scikit-learn.
pip install datascrub
Basic Usage
To use DataScrub in your Python projects, import the package and create an instance of the DataClean class:
from datascrub.cleaner import DataClean
import pandas as pd
# Load standard in-memory DataFrame (or pass the string path directly for chunking!)
df = pd.read_csv("data.csv")
cleaner = DataClean(df, use_pyarrow=True) # Opt-in for C++ speeds
# Execute a massive pipeline combining imputation, outlier clipping, and feature engineering
cleaned_data = cleaner.prep(
clean='all',
missing_values={'Age': 'fill with median', 'Salary': 'knn-imputer'},
outliers_method='iqr',
noise_columns=['User_Bio'],
encoding_method='one-hot',
encoding_columns=['City']
)
# View telemetry
cleaner.summary()
Scikit-Learn Pipeline Usage
from sklearn.pipeline import Pipeline
from datascrub.cleaner import DataClean
from sklearn.ensemble import RandomForestClassifier
# Configure DataClean cleanly using kwargs
cleaner = DataClean(clean='all', encoding_method='label', encoding_columns=['Status'])
# Bind it directly into an ML pipeline
pipe = Pipeline([
('scrubber', cleaner),
('classifier', RandomForestClassifier())
])
# Fit on training sets dynamically
# pipe.fit(X_train, y_train)
Comprehensive API Guide
The DataClean class is extremely robust. Below is a complete reference to all the capabilities you can invoke through the prep() orchestrator or by passing arguments to DataClean(obj, **kwargs).
Memory & File Handling Configurations
When initializing DataClean(), you have multiple advanced IO options:
obj(pd.DataFrame, str): Pass an in-memory Pandas dataframe, or provide a string pointing to a local.csvor.xlsxfile. A string file path is required to utilize chunking.use_pyarrow(bool): Set toTrueto swap the backend to C++ PyArrow (drastically speeds up string evaluation and lowers memory footprints). Defaults toFalse.chunksize(int): If provided,DataScrubacts as an out-of-core generator, processing the target.csvfile in batches ofchunksizerows and safely pushing the consolidated memory-light output to a centralized*_cleaned.csvon your disk. (Auto-enables if the dataset is >100MB).
Telemetry Profiling
.summary(): Prints a detailed dataframe profile outlining strictly typed Memory Usage (in MBs), dataset shape, sparsity (percentage of NaNs), variance, and skewness across all columns..downcast(): Strictly bounded numeric detection. Scans the dataframe and intelligently compresses unoptimizedfloat64/int64datatypes to their tightest binary limits (e.g.,int8,float32).
prep() Parameters
The prep() method orchestrates the precise execution order of your data engineering pipeline. It accepts the following arguments:
Standard Cleaning
clean(str, list): Defaults to'all'. Cleans string columns by stripping leading/trailing whitespace, converting text to lowercase, and demojizing emojis.
Imputation & Missing Values
missing_values(dict): Provide a dictionary where the key is the column name and the value is the imputation strategy.- Example Options:
'drop','fill with mean','fill with median','fill with mode','fill with backward fill along columns/rows'. - Machine Learning Strategies: Provide
'knn-imputer'(utilizes 5-neighbors via sklearn) or'iterative-imputer'(utilizes MICE via sklearn) for multivariate feature-aware imputing.
- Example Options:
Analytics Extrapolations
parse_date(list): Provide a list of columns to natively cast todatetime64[ns]formatted rigidly asYYYY-MM-DD.extract_datetime(list): Explodes a date-string column into four distinct integers for Machine Learning:[Col]_Year,[Col]_Month,[Col]_Day, and[Col]_Is_Weekend.explode(dict): Splits string values based on delimiters and cascades them downward into distinct rows. Example:{'Tags': ','}splits'A,B'into row 1'A'and row 2'B'.translate_column_names(dict): UtilizesgoogletransAPI to asynchronously translate string records to English. Set dictionary boolean values toTrueto overwrite the existing column, orFalseto generate a dedicated[Col]_translatedfeature.
Feature Engineering & Mitigations
noise_columns(list): A vector of columns to strip messy HTML tags (e.g.,<p>foo</p>), broken URL requests, and arbitrary string punctuation via C-level Regex patterns.outliers_method(str): Mitigate extreme outlier biases by capping/clipping boundaries instead of destroying row parity. Options are'iqr'and'z-score'.outlier_columns(list): Define bounds. Defaults to'all'numeric arrays.perform_scaling_normalization_bool(bool): Applies strict mathematical Box-Cox transformations across numeric matrices. Note: Shifts non-positive values cleanly tomin+1iteratively.
Machine Learning Encoding
encoding_method(str): Translates strings/categorical classes into ML-compatible features natively relying onscikit-learn. Options:'one-hot': Expands columns to dummy boundaries (drop_first=Trueenabled).'label': Binds strict Ordinal IDs to unique text values globally.'target': Maps distinct feature ratios based precisely on bounding biases via Target Encoded boundaries.
encoding_columns(list): Columns to evaluate. Defaults to'all'.target_col(str): Required strictly if utilizingtargetencoding.
Contributing
Contributions to DataScrub are welcome! If you encounter any bugs, have suggestions for improvements, or would like to add new features, please open an issue or submit a pull request on the GitHub repository.
License
This project is licensed under the MIT License. See the LICENSE file for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datascrub-2.0.0.tar.gz.
File metadata
- Download URL: datascrub-2.0.0.tar.gz
- Upload date:
- Size: 17.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
145b617ccc74b81d31e1488f6a7fc35cd3561817e475c7f2cad4a5f483361ff6
|
|
| MD5 |
7b613497470954d789e50e338ecdb46d
|
|
| BLAKE2b-256 |
4cddf47ae2ed7ac752946cb3ffe5bd1d7e9d9e7566e0fae9e5f30d69755fb682
|
File details
Details for the file datascrub-2.0.0-py3-none-any.whl.
File metadata
- Download URL: datascrub-2.0.0-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
989bcaa8d3747806d37cb72eafb9636ae7f47c8c6be1b60bc944c4178153de71
|
|
| MD5 |
131070783863a02e7e9c0cf0d2f058cc
|
|
| BLAKE2b-256 |
27b318a2837c31943b35850b79475b0f2ec59fefaee2aabec1f40a1c5d2e9010
|