First Automated Data Preparation library powered by Deep Learning to automatically clean and prepare TBs of data on clusters at scale.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

mltrons dptron: Dirty Data in, Clean Data Out!

https://pypi.org/project/mltronsAutoDataPrep/

Introduction

Data is the most important element for data analysis. Real world data is unclean with a lot of spelling errors, missing values, formatting issues, skewness, no encoding or aggregation which makes it the most time-consuming & cumbersome task for analysts & scientists. As most of the scientists spend time around 80% of their time cleaning & preparing data, therefore weâ€™re introducing dptron to make that process extremely easier and faster!

Dptron is an in-memory platform built for distributed & scalable data cleaning & preparation. DPtron is written in Python and is built on PySpark to deal with large amounts of data seamlessly. It uses an implementation of machine learning and deep learning algorithms to perform important data cleaning & preparation steps automatically. Dptron is extensible so that developers, analysts & scientists can streamline the process of data cleaning & preparation for better decision making while becoming more productive.

Decision making is better & easier if the data is clean otherwise itâ€™s garbage-in and garbage-out.

Important Features

Supports connection with AWS S3
Supports upto 10TB of data size
Treats spelling mistakes and other inconsistencies in URLs
Detects & treats skewness in data
Feature engineering for time variable
Treats & fills NULL values by using deep learning (next iteration)
Treats spelling mistakes and other inconsistencies in other variables (next iteration)

GETTING STARTED WITH DPTRON - AUTO DATA PREP

Installing On Mac Os

Open up your terminal and install Java8 required for pySpark:

brew cask install adoptopenjdk/openjdk/adoptopenjdk8**

After installing Java8, set it as your default Java version:

/usr/libexec/java_home -V**

This will output thefollowing:

Matching Java Virtual Machines (3):

1.8.0_05, x86_64:   "Java SE 8" /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home
1.6.0_65-b14-462, x86_64:   "Java SE 6" /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
1.6.0_65-b14-462, i386: "Java SE 6" /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
/Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home

Pick the version you want to be the default (i.e 1.6.0_65-b14-462) then:

export JAVA_HOME=/usr/libexec/java_home -v 1.8**

After you've successfully install Java8, install dptron with the following command:

pip install mltronsAutoDataPrep

Installing on Windows

It's important that you replace all the paths that include the folder "Program Files" or "Program Files (x86)" to avoid future problems while running Spark.

If you have Java already installed, you still need to fix the JAVA_HOME and PATH variables. To do that, you need to:

1. Rename "Program Files" with "Progra~1"

2. Rename "Program Files (x86)" with "Progra~2"

Example: "C:\Program FIles\Java\jdk1.8.0_161" --> "C:\Progra~1\Java\jdk1.8.0_161"

After renaming, make sure you have Java 8 installed and the environment variables correctly defined1:

3. Download Java JDK 8 from [Java's official website] (https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)

After installing Java SDK 8, set the following environment variables:

4. JAVA_HOME = C:\Progra~1\Java\jdk1.8.0_161

5. PATH += C:\Progra~1\Java\jdk1.8.0_161\bin

After you've successfully installed and configured Java8, install dptron with the following command:

pip install mltronsAutoDataPrep

Using dptron

1. Reading data functions

address path of the file
local location of the file exist (local pc or s3 bucket)
file_format format of the file (csv,excel,parquet)
s3 s3 bucket credentials (applicable only if data on s3 bucket)

from mltronsAutoDataPrep.lib.v2.Operations.readfile import ReadFile as rf

res = rf.read(address="test.csv", local="yes", file_format="csv", s3={})

2. Drop Features containing Null of certain threshold

provide dataframe with threshold of null values
return the list of columns containing null values more then the threshold

from mltronsAutoDataPrep.lib.v2.Middlewares.drop_col_with_null_val import DropNullValueCol

res = rf.read("test.csv", file_format='csv')

drop_col = DropNullValueCol()
columns_to_drop = drop_col.delete_var_with_null_more_than(res, threshold=30)
df = res.drop(*columns_to_drop)

3. Drop Features containing same values

provide dataframe
return the list of columns containing same values

from mltronsAutoDataPrep.lib.v2.Middlewares.drop_col_with_same_val import DropSameValueColumn


drop_same_val_col = DropSameValueColumn()
columns_to_drop = drop_same_val_col.delete_same_val_com(res)
df = res.drop(*columns_to_drop)

4. Cleaned Url Features

Automatically detects features containing Urls
Pipeline structure to clean the urls using NLP techniques

from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline

etl_pipeline = EtlPipeline()
etl_pipeline.custom_url_transformer(res)
res = etl_pipeline.transform(res)

5. Split Date Time features

Automatically detects features containing date/time
Split date time into usefull multiple feautures (day,month,year etc)

from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline


etl_pipeline = EtlPipeline()
etl_pipeline.custom_date_transformer(res)
res = etl_pipeline.transform(res)

6. Filling Missing Values

Using Deep Learning techniques Missing values are filled

from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline


etl_pipeline = EtlPipeline()
etl_pipeline.custom_filling_missing_val(res)
res = etl_pipeline.transform(res)

7. Removing Skewness from features

Automatically detects which column contains skewness
Minimize skewness using statistical methods

from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline


etl_pipeline = EtlPipeline()
etl_pipeline.custom_skewness_transformer(res)
res = etl_pipeline.transform(res)

8. Remove Spelling mistakes

Provide list of features in which contains spelling mistakes

from mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline


etl_pipeline = EtlPipeline()
etl_pipeline.custom_spell_transformer(res,['col1','col2'])
res2 = etl_pipeline.transform(res)

Dependencies

Java 8
PySpark
NumPy
pandas
python-dateutil
pytz
see full list of dependicies here

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.15

Nov 28, 2019

0.0.14

Nov 14, 2019

0.0.13

Nov 13, 2019

0.0.12

Nov 12, 2019

0.0.11

Nov 10, 2019

0.0.10

Nov 10, 2019

0.0.9

Nov 10, 2019

0.0.8

Nov 9, 2019

0.0.7

Nov 9, 2019

0.0.6

Nov 9, 2019

0.0.5

Nov 9, 2019

0.0.4

Nov 9, 2019

0.0.3

Nov 9, 2019

0.0.2

Nov 9, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mltronsAutoDataPrep-0.0.15-py3-none-any.whl (38.9 kB view details)

Uploaded Nov 28, 2019 Python 3

File details

Details for the file mltronsAutoDataPrep-0.0.15-py3-none-any.whl.

File metadata

Download URL: mltronsAutoDataPrep-0.0.15-py3-none-any.whl
Upload date: Nov 28, 2019
Size: 38.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9

File hashes

Hashes for mltronsAutoDataPrep-0.0.15-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e8cfb44c86052db956208271671c5c6eb74f047739383b63bbd355f06725c310`
MD5	`ed492fdac8a1730dbcee21edb73c9654`
BLAKE2b-256	`50309d5ff3a40c48cf1557f1c8bdd5451f2feee3d66d2f5ce3c7722a50feca7f`

See more details on using hashes here.

mltronsAutoDataPrep 0.0.15

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mltrons dptron: Dirty Data in, Clean Data Out!

Introduction

Important Features

GETTING STARTED WITH DPTRON - AUTO DATA PREP

Installing On Mac Os

Installing on Windows

Using dptron

1. Reading data functions

2. Drop Features containing Null of certain threshold

3. Drop Features containing same values

4. Cleaned Url Features

5. Split Date Time features

6. Filling Missing Values

7. Removing Skewness from features

8. Remove Spelling mistakes

Dependencies

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes