Skip to main content

PyScrub is a powerful Python library designed to streamline data preprocessing and pipeline automation. It provides efficient tools for data cleaning, transformation, feature engineering, and visualization, all integrated into a reproducible and scalable pipeline framework.

Project description

PyScrub

PyScrub is a powerful and flexible library designed to simplify data preprocessing, transformation, and visualization workflows. It allows you to seamlessly integrate data cleaning, feature engineering, and visualization into a single automated pipeline, saving time and ensuring consistent results. PyScrub is ideal for machine learning, data analysis, and research projects.

Features:

  • Automated data preprocessing pipeline for handling missing values, removing duplicates, correcting data types, and more.
  • Data normalization, standardization, and feature engineering built into the pipeline.
  • Powerful visualization tools for quickly understanding your data.
  • Customizable and modular design, allowing you to extend the pipeline with your own functions.
  • Focus on automation and reproducibility to streamline your data workflow.

Installation

You can install the PyScrub package using pip:

pip install PyScrub

Usage

Pipeline Setup

Set up a data processing pipeline with PyScrub to clean, transform, and visualize your dataset.

from PyScrub.pipeline_integration import DataPipeline, PipelineMonitor
import PyScrub.data_cleaning as dc
import PyScrub.data_transformation as dt
import PyScrub.feature_engineering as fe
import PyScrub.visualization as viz

# Create your pipeline and add steps
pipeline = DataPipeline()
pipeline.add_step(dc.handle_missing_values, method='ffill')
pipeline.add_step(dc.remove_duplicates)
pipeline.add_step(dc.correct_data_types)
pipeline.add_step(dc.strip_whitespace, columns=['Gender'])
pipeline.add_step(dt.normalize)
pipeline.add_step(fe.create_polynomial_features, degree=2)
pipeline.add_step(fe.apply_pca, n_components=2)

# Monitor and execute the pipeline
monitor = PipelineMonitor()
cleaned_data = monitor.monitor(pipeline, data)

# Visualize the results
viz.histogram(cleaned_data)
viz.boxplot(cleaned_data, num_features=['Age', 'MonthlyIncome'], target='Occupation')

Data Cleaning

Use PyScrub's data cleaning functions to handle missing values, remove duplicates, and ensure your data types are correct.

import PyScrub.data_cleaning as dc

# Handling missing values
cleaned_data = dc.handle_missing_values(data, method='mean')

# Removing duplicates
cleaned_data = dc.remove_duplicates(cleaned_data)

# Correcting data types
cleaned_data = dc.correct_data_types(cleaned_data)

Feature Engineering

Enhance your dataset with polynomial features, interactions, and dimensionality reduction using PyScrub's feature engineering tools.

import PyScrub.feature_engineering as fe

# Create polynomial features
poly_features = fe.create_polynomial_features(data, degree=3)

# Apply PCA for dimensionality reduction
pca_features = fe.apply_pca(data, n_components=3)

Data Visualization

Generate visual insights into your dataset using PyScrub's visualization tools.

import PyScrub.visualization as viz

# Plot missing data
viz.plot_missing(data)

# Create histograms and boxplots
viz.histogram(data)
viz.boxplot(data, num_features=['Age', 'MonthlyIncome'], target='Occupation')

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyScrub-0.0.1.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

PyScrub-0.0.1-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file PyScrub-0.0.1.tar.gz.

File metadata

  • Download URL: PyScrub-0.0.1.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for PyScrub-0.0.1.tar.gz
Algorithm Hash digest
SHA256 48adbfb99d5154a16b5f0c9b7866da42b7989622aae901f8ee0e749e21549512
MD5 d29f5583dabeb6f1dd6682010ad4e587
BLAKE2b-256 e99f084ce2b68614922b60804bd14dbc85269cfb0e087e4744effd5178c18aed

See more details on using hashes here.

File details

Details for the file PyScrub-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: PyScrub-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 12.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for PyScrub-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 df391dbe736627382474e3a92e62d35519c8a107d74db16948bed037745b8d27
MD5 15275f0ca8bd1891464de2bb00acc8be
BLAKE2b-256 600b1b64efd824984a3777f9cc1ca1005bfbc330df30bb7d4214000208b2f8ad

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page