Performing ETL using Machine Learning
Project description
MAHA
MAHA is an in-progress ETL package which uses machine learning to clean your dataset with one line command. Features of MAHA include :-
- Drop all the index columns
- Drop columns with too many missing values
- Using Regression to find the missing values in the data and then replacing them
Prerequisites
- Data is in pandas DataFrame format
- All the categorical variables are label encoded
- All the columns are in the desired data type of the output
You can also:
- Find the mean and mode of every column
- Fill the NA values with mean and mode of the columnns depending on the datatype
- Find a model for every column with all other columns being the independent variables
Dependencies
MAHA uses a number of open source projects to work properly:
- [NumPy] - NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- [Pandas] - Pandas is a software library written for the Python programming language for data manipulation and analysis.
- [Sklearn] - Machine Learning library which includes various classification, regression and clustering algorithms
Installation
MAHA requires pandas, numpy and sklearn
Use pip to install the packages
$ pip3 install pandas
$ pip3 install numpy
$ pip3 install sklearn
If you have not installed pip, you can do it by
$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
Then run the following command where you have installed get-pip.py
$ python get-pip.py
Development
Developed By :- [Mithesh R], [Arth Akhouri], [Heetansh Jhaveri], [Ayaan Khan]
Want to contribute? Navigate to our GitHub for more information GitHub Repository - [MAHA]
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
MAHA-1.3.tar.gz
(4.8 kB
view hashes)