First Automated Data Preparation library powered by Deep Learning to automatically clean and prepare TBs of data on clusters at scale.
Project description
mltrons-auto-data-prep :Tool kit that automate Data Preparation
What is it?
Mltrons-auto-data-prep is a Python package providing flexible and automated way of data preparation in any size of the raw data.It uses Machine Learning and Deep Leaning techniques with the pyspark back-end architecture to clean and prepare TBs of data on clusters at scale.
Main Features
Here are just a few of the things that Mltrons-auto-data-prep does well:
-
Data Can be read from multiple Sources such as S3 bucket or Local PC
-
Handle Any size of data even in Tbs using Py-spark
-
Filter out Features with Null values more than the threshold
-
Filter out Features with same value for all rows
-
Automatically detects the data type of features
-
Automatically detects datetime features and split in multiple usefull features
-
Automatically detects features containing URLs and remove duplications
-
Automatically detects Skewed features and minimize skewness
Where to get it
The source code is currently hosted on GitHub at: https://github.com/ms8909/mltrons-auto-data-prep
The pypi project is at : https://pypi.org/project/mltronsAutoDataPrep/
How to install
pip install mltronsAutoDataPrep
Dependencies
How to use
1. Reading data functions
-
address to give the path of the file
-
local to give the file exist on local pc or s3 bucket
-
file_format to give the format of the file (csv,excel,parquet)
-
s3 s3 bucket credentials if data on s3 bucket
from mltronsAutoDataPrep.lib.v2.Operations.readfile import ReadFile as rf
res = rf.read(address="test.csv", local="yes", file_format="csv", s3={})
2. Drop Features containing Null of certain threshold
-
provide dataframe with threshold of null values
-
return the list of columns containing null values more then the threshold
from mltronsAutoDataPrep.lib.v2.Middlewares.drop_col_with_null_val import DropNullValueCol
res = rf.read("test.csv", file_format='csv')
drop_col = DropNullValueCol()
columns_to_drop = drop_col.delete_var_with_null_more_than(res, threshold=30)
df = res.drop(*columns_to_drop)
3. Drop Features containing same values
-
provide dataframe
-
return the list of columns containing same values
from mltronsAutoDataPrep.lib.v2.Middlewares.drop_col_with_same_val import DropSameValueColumn
drop_same_val_col = DropSameValueColumn()
columns_to_drop = drop_same_val_col.delete_same_val_com(res)
df = res.drop(*columns_to_drop)
4. Cleaned Url Features
-
Automatically detects features containing Urls
-
Pipeline structure to clean the urls using NLP techniques
from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline
etl_pipeline = EtlPipeline()
etl_pipeline.custom_url_transformer(res)
res = etl_pipeline.transform(res)
5. Split Date Time features
-
Automatically detects features containing date/time
-
Split date time into usefull multiple feautures (day,month,year etc)
from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline
etl_pipeline = EtlPipeline()
etl_pipeline.custom_date_transformer(res)
res = etl_pipeline.transform(res)
6. Filling Missing Values
- Using Deep Learning techniques Missing values are filled
from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline
etl_pipeline = EtlPipeline()
etl_pipeline.custom_filling_missing_val(res)
res = etl_pipeline.transform(res)
7. Removing Skewness from features
-
Automatically detects which column contains skewness
-
Minimize skewness using statistical methods
from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline
etl_pipeline = EtlPipeline()
etl_pipeline.custom_skewness_transformer(res)
res = etl_pipeline.transform(res)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for mltronsAutoDataPrep-0.0.12-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f76ec10dd30fab626f6882970f7c004a3dba0d5b959eba77c7d0240c25d3b12c |
|
MD5 | 2e7263d7de2ffdcc5ad010574c7ec27c |
|
BLAKE2b-256 | ef701413aa2a4cb51cbfcaad86bce6cb264b0dd2c444cc205b74bf5e91d641bf |