Skip to main content

Performing ETL using Machine Learning

Project description

MAHA

MAHA is an in-progress ETL package which uses machine learning to clean your dataset with one line command. Features of MAHA include :-

  • Drop all the index columns
  • Drop columns with too many missing values
  • Using Regression to find the missing values in the data and then replacing them

Prerequisites

  • Data is in pandas DataFrame format
  • All the categorical variables are label encoded
  • All the columns are in the desired data type of the output

You can also:

  • Find the mean and mode of every column
  • Fill the NA values with mean and mode of the columnns depending on the datatype
  • Find a model for every column with all other columns being the independent variables

Dependencies

MAHA uses a number of open source projects to work properly:

  • [NumPy] - NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
  • [Pandas] - Pandas is a software library written for the Python programming language for data manipulation and analysis.
  • [Sklearn] - Machine Learning library which includes various classification, regression and clustering algorithms

Installation

MAHA requires pandas, numpy and sklearn

Use pip to install the packages

$ pip3 install pandas
$ pip3 install numpy
$ pip3 install sklearn

If you have not installed pip, you can do it by

$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

Then run the following command where you have installed get-pip.py

$ python get-pip.py

Development

Developed By :- [Mithesh R], [Arth Akhouri], [Heetansh Jhaveri], [Ayaan Khan]

Want to contribute? Navigate to our GitHub for more information GitHub Repository - [MAHA]

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

MAHA-1.1.tar.gz (4.7 kB view details)

Uploaded Source

File details

Details for the file MAHA-1.1.tar.gz.

File metadata

  • Download URL: MAHA-1.1.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.3

File hashes

Hashes for MAHA-1.1.tar.gz
Algorithm Hash digest
SHA256 7e9edf5d8dfc487b4b3dc0b9b9fb449202eecb67a87c90e1736649f561fec9bb
MD5 7f841a30eaefe8d89060fc3a65981811
BLAKE2b-256 e6a81ac18f48d1e1acbdef248e5a113d4705d004c3ca0a01af71c56a27696229

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page