Skip to main content

Automate data analysis pipelines for data analyst

Project description

end2endML package

The end2endML Python package implemented all the components, data preprocessing, data splitting, model selection, model fitting and model evaluation, required for defining pipelines to do do automate data analysis using some most commonly used machine learning algorithms.

Installation

Install end2endML package by running:

pip install end2endML

on the command line of either Linux system or the Anaconda Prompt on Windows system. If you don't have root privileges, some times you need to add --user after the above commands, then pip will install the packages in your home directory. which doesn't require root privileges.

User guide

User guide is available at https://end2endml.readthedocs.io/en/latest/.

TODO

  • Implement feature extraction feature to the models.
    • The feature extraction methods only implemented for linear models, svm and neural network. For Tree based methods, they are not implemented.
    • The number of components are taken as a hyperparameter for model selection.
  • Implement the unite test suite to do automate testing for every update.
  • Currently, if we specify a gradient boosting model for imbalanced classification both RUSBOOST and EASYENSYMBLE, which differs in how the undersampling is implemented, are selected and trained. Need to find a way to let the user to set it.
  • If the trained model has already used 10 cores, specify the CV procedure to use another 10 cores, in general is Ok. However, it can be a problem for easyensemble models when the data set is large. Fix it by set the CV procedure n_jobs to be None in easyensembler model
  • Add the fun to check if the preprocessed data is avaliable. If the data is avaliable, there is no need to preprocess the data anymore. Myabe this is not a good idea, as sometime we may use different parameters to control the behavior to do data preprocessing. And the time to re-preprocess time is not much.
  • Bug. The data analysis pipline should has the ability to remove the inifnte values existed in X and y.
  • When cat_threshold set to 2, which means we are not going to classify the subjects with numerical data type but with limited unique values, then the y will not be transformed to object data type, then the automate data analysis procedure will take it as a regression task.
  • We should re-save the preprocessed data sets every time. Currently, if the function detect the preprocessed data has already saved, it will not save the preprocessed data anymore. This can lead to serious issue when the data preprocessing parameters change. In addition, it doesn't take much time, we should save the preprocessed data.
  • For binary classificatoin and regression problems, the saved feature importances should be one dimentional rather than two dimensional.
  • --user, why
  • Keep track of all the preprocessing steps, so we can apply the exat same preprocessing steps to the new data.
  • Add Dan and Mengzhe to author list. Haven't got the agreement from Mengzhe and Dan. Thus, only include them into the credits.
  • Print out time

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

end2endML-0.8.0.tar.gz (28.0 kB view hashes)

Uploaded Source

Built Distribution

end2endML-0.8.0-py3-none-any.whl (31.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page