PyScrub is a powerful Python library designed to streamline data preprocessing and pipeline automation. It provides efficient tools for data cleaning, transformation, feature engineering, and visualization, all integrated into a reproducible and scalable pipeline framework.
Project description
PyScrub
PyScrub is a powerful and flexible library designed to simplify data preprocessing, transformation, and visualization workflows. It allows you to seamlessly integrate data cleaning, feature engineering, and visualization into a single automated pipeline, saving time and ensuring consistent results. PyScrub is ideal for machine learning, data analysis, and research projects.
Features:
- Automated data preprocessing pipeline for handling missing values, removing duplicates, correcting data types, and more.
- Data normalization, standardization, and feature engineering built into the pipeline.
- Powerful visualization tools for quickly understanding your data.
- Customizable and modular design, allowing you to extend the pipeline with your own functions.
- Focus on automation and reproducibility to streamline your data workflow.
Installation
You can install the PyScrub package using pip:
pip install PyScrub
Usage
Pipeline Setup
Set up a data processing pipeline with PyScrub to clean, transform, and visualize your dataset.
from PyScrub.pipeline_integration import DataPipeline, PipelineMonitor
import PyScrub.data_cleaning as dc
import PyScrub.data_transformation as dt
import PyScrub.feature_engineering as fe
import PyScrub.visualization as viz
# Create your pipeline and add steps
pipeline = DataPipeline()
pipeline.add_step(dc.handle_missing_values, method='ffill')
pipeline.add_step(dc.remove_duplicates)
pipeline.add_step(dc.correct_data_types)
pipeline.add_step(dc.strip_whitespace, columns=['Gender'])
pipeline.add_step(dt.normalize)
pipeline.add_step(fe.create_polynomial_features, degree=2)
pipeline.add_step(fe.apply_pca, n_components=2)
# Monitor and execute the pipeline
monitor = PipelineMonitor()
cleaned_data = monitor.monitor(pipeline, data)
# Visualize the results
viz.histogram(cleaned_data)
viz.boxplot(cleaned_data, num_features=['Age', 'MonthlyIncome'], target='Occupation')
Data Cleaning
Use PyScrub's data cleaning functions to handle missing values, remove duplicates, and ensure your data types are correct.
import PyScrub.data_cleaning as dc
# Handling missing values
cleaned_data = dc.handle_missing_values(data, method='mean')
# Removing duplicates
cleaned_data = dc.remove_duplicates(cleaned_data)
# Correcting data types
cleaned_data = dc.correct_data_types(cleaned_data)
Feature Engineering
Enhance your dataset with polynomial features, interactions, and dimensionality reduction using PyScrub's feature engineering tools.
import PyScrub.feature_engineering as fe
# Create polynomial features
poly_features = fe.create_polynomial_features(data, degree=3)
# Apply PCA for dimensionality reduction
pca_features = fe.apply_pca(data, n_components=3)
Data Visualization
Generate visual insights into your dataset using PyScrub's visualization tools.
import PyScrub.visualization as viz
# Plot missing data
viz.plot_missing(data)
# Create histograms and boxplots
viz.histogram(data)
viz.boxplot(data, num_features=['Age', 'MonthlyIncome'], target='Occupation')
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file PyScrub-0.0.1.tar.gz
.
File metadata
- Download URL: PyScrub-0.0.1.tar.gz
- Upload date:
- Size: 12.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 48adbfb99d5154a16b5f0c9b7866da42b7989622aae901f8ee0e749e21549512 |
|
MD5 | d29f5583dabeb6f1dd6682010ad4e587 |
|
BLAKE2b-256 | e99f084ce2b68614922b60804bd14dbc85269cfb0e087e4744effd5178c18aed |
File details
Details for the file PyScrub-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: PyScrub-0.0.1-py3-none-any.whl
- Upload date:
- Size: 12.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | df391dbe736627382474e3a92e62d35519c8a107d74db16948bed037745b8d27 |
|
MD5 | 15275f0ca8bd1891464de2bb00acc8be |
|
BLAKE2b-256 | 600b1b64efd824984a3777f9cc1ca1005bfbc330df30bb7d4214000208b2f8ad |