Skip to main content

First Automated Data Preparation library powered by Deep Learning to automatically clean and prepare TBs of data on clusters at scale.

Project description

mltrons-auto-data-prep :Tool kit that automate Data Preparation

What is it?

Mltrons-auto-data-prep is a Python package providing flexible and automated way of data preparation in any size of the raw data.It uses Machine Learning and Deep Leaning techniques with the pyspark back-end architecture to clean and prepare TBs of data on clusters at scale.

Main Features

Here are just a few of the things that Mltrons-auto-data-prep does well:

  • Data Can be read from multiple Sources such as S3 bucket or Local PC

  • Handle Any size of data even in Tbs using Py-spark

  • Filter out Features with Null values more than the threshold

  • Filter out Features with same value for all rows

  • Automatically detects the data type of features

  • Automatically detects datetime features and split in multiple usefull features

  • Automatically detects features containing URLs and remove duplications

  • Automatically detects Skewed features and minimize skewness

Where to get it

The source code is currently hosted on GitHub at: https://github.com/ms8909/mltrons-auto-data-prep

The pypi project is at : https://pypi.org/project/mltronsAutoDataPrep/

How to install

pip install mltronsAutoDataPrep

Dependencies

How to use

1. Reading data functions

  • address to give the path of the file

  • local to give the file exist on local pc or s3 bucket

  • file_format to give the format of the file (csv,excel,parquet)

  • s3 s3 bucket credentials if data on s3 bucket

from mltronsAutoDataPrep.lib.v2.Operations.readfile import ReadFile as rf

res = rf.read(address="test.csv", local="yes", file_format="csv", s3={})

2. Drop Features containing Null of certain threshold

  • provide dataframe with threshold of null values

  • return the list of columns containing null values more then the threshold

from mltronsAutoDataPrep.lib.v2.Middlewares.drop_col_with_null_val import DropNullValueCol

res = rf.read("test.csv", file_format='csv')

drop_col = DropNullValueCol()
columns_to_drop = drop_col.delete_var_with_null_more_than(res, threshold=30)
df = res.drop(*columns_to_drop)

3. Drop Features containing same values

  • provide dataframe

  • return the list of columns containing same values

from mltronsAutoDataPrep.lib.v2.Middlewares.drop_col_with_same_val import DropSameValueColumn


drop_same_val_col = DropSameValueColumn()
columns_to_drop = drop_same_val_col.delete_same_val_com(res)
df = res.drop(*columns_to_drop)

4. Cleaned Url Features

  • Automatically detects features containing Urls

  • Pipeline structure to clean the urls using NLP techniques

from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline

etl_pipeline = EtlPipeline()
etl_pipeline.custom_url_transformer(res)
res = etl_pipeline.transform(res)

5. Split Date Time features

  • Automatically detects features containing date/time

  • Split date time into usefull multiple feautures (day,month,year etc)

from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline


etl_pipeline = EtlPipeline()
etl_pipeline.custom_date_transformer(res)
res = etl_pipeline.transform(res)

6. Filling Missing Values

  • Using Deep Learning techniques Missing values are filled
from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline


etl_pipeline = EtlPipeline()
etl_pipeline.custom_filling_missing_val(res)
res = etl_pipeline.transform(res)

7. Removing Skewness from features

  • Automatically detects which column contains skewness

  • Minimize skewness using statistical methods

from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline


etl_pipeline = EtlPipeline()
etl_pipeline.custom_skewness_transformer(res)
res = etl_pipeline.transform(res)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

mltronsAutoDataPrep-0.0.11-py3-none-any.whl (35.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page