Skip to main content

First Automated Data Preparation library powered by Deep Learning to automatically clean and prepare TBs of data on clusters at scale.

Project description

mltrons-auto-data-prep :Tool-kit that automates Data Preparation

What is it?

Mltrons-auto-data-prep is a Python package providing flexible and automated way of data preparation in any size of the raw data.It uses Machine Learning and Deep Leaning techniques with the pyspark back-end architecture to clean and prepare TBs of data on clusters at scale.

Main Features

Here are just a few of the things that Mltrons-auto-data-prep does well:

  • Data Can be read from multiple Sources such as S3 bucket or Local PC

  • Handle Any size of data even in Tbs using Py-spark

  • Filter out Features with Null values more than the threshold

  • Filter out Features with same value for all rows

  • Automatically detects the data type of features

  • Automatically detects datetime features and split in multiple usefull features

  • Automatically detects features containing URLs and remove duplications

  • Automatically detects Skewed features and minimize skewness

Where to get it

The source code is currently hosted on GitHub at: https://github.com/ms8909/mltrons-auto-data-prep

The pypi project is at : https://pypi.org/project/mltronsAutoDataPrep/

How to install Java 8 (only supported by pyspark)

In Mac Os

In your terminal, write:

1. brew cask install adoptopenjdk/openjdk/adoptopenjdk8

Now, you need to set java8 as your default version. To do this: First run

2. /usr/libexec/java_home -V

which will output something like the following: Matching Java Virtual Machines (3):

1.8.0_05, x86_64:   "Java SE 8" /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home
1.6.0_65-b14-462, x86_64:   "Java SE 6" /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
1.6.0_65-b14-462, i386: "Java SE 6" /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
/Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home

Pick the version you want to be the default (i.e 1.6.0_65-b14-462) then:

3. export JAVA_HOME=/usr/libexec/java_home -v 1.8

In Windows Os

It's important that you replace all the paths that include the folder "Program Files" or "Program Files (x86)" as explained below to avoid future problems when running Spark. If you have Java already installed, you still need to fix the JAVA_HOME and PATH variables

1. Replace "Program Files" with "Progra~1"

2. Replace "Program Files (x86)" with "Progra~2"

Example: "C:\Program FIles\Java\jdk1.8.0_161" --> "C:\Progra~1\Java\jdk1.8.0_161"

Before you start make sure you have Java 8 installed and the environment variables correctly defined1:

3. Download Java JDK 8 from Java's official website

Set the following environment variables:

4. JAVA_HOME = C:\Progra~1\Java\jdk1.8.0_161

5. PATH += C:\Progra~1\Java\jdk1.8.0_161\bin

How to install

pip install mltronsAutoDataPrep

Dependencies

How to use

1. Reading data functions

  • address to give the path of the file

  • local to give the file exist on local pc or s3 bucket

  • file_format to give the format of the file (csv,excel,parquet)

  • s3 s3 bucket credentials if data on s3 bucket

from mltronsAutoDataPrep.lib.v2.Operations.readfile import ReadFile as rf

res = rf.read(address="test.csv", local="yes", file_format="csv", s3={})

2. Drop Features containing Null of certain threshold

  • provide dataframe with threshold of null values

  • return the list of columns containing null values more then the threshold

from mltronsAutoDataPrep.lib.v2.Middlewares.drop_col_with_null_val import DropNullValueCol

res = rf.read("test.csv", file_format='csv')

drop_col = DropNullValueCol()
columns_to_drop = drop_col.delete_var_with_null_more_than(res, threshold=30)
df = res.drop(*columns_to_drop)

3. Drop Features containing same values

  • provide dataframe

  • return the list of columns containing same values

from mltronsAutoDataPrep.lib.v2.Middlewares.drop_col_with_same_val import DropSameValueColumn


drop_same_val_col = DropSameValueColumn()
columns_to_drop = drop_same_val_col.delete_same_val_com(res)
df = res.drop(*columns_to_drop)

4. Cleaned Url Features

  • Automatically detects features containing Urls

  • Pipeline structure to clean the urls using NLP techniques

from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline

etl_pipeline = EtlPipeline()
etl_pipeline.custom_url_transformer(res)
res = etl_pipeline.transform(res)

5. Split Date Time features

  • Automatically detects features containing date/time

  • Split date time into usefull multiple feautures (day,month,year etc)

from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline


etl_pipeline = EtlPipeline()
etl_pipeline.custom_date_transformer(res)
res = etl_pipeline.transform(res)

6. Filling Missing Values

  • Using Deep Learning techniques Missing values are filled
from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline


etl_pipeline = EtlPipeline()
etl_pipeline.custom_filling_missing_val(res)
res = etl_pipeline.transform(res)

7. Removing Skewness from features

  • Automatically detects which column contains skewness

  • Minimize skewness using statistical methods

from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline


etl_pipeline = EtlPipeline()
etl_pipeline.custom_skewness_transformer(res)
res = etl_pipeline.transform(res)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

mltronsAutoDataPrep-0.0.14-py3-none-any.whl (36.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page