Skip to main content

JustDataML :Simplified Machine Learning for Everyone

Project description

JustDataML :Simplified Machine Learning for Everyone

Welcome to JustDataML, a user-friendly tool designed to make machine learning accessible to all, regardless of technical background. JustDataML automates the process of model selection, data preprocessing, and prediction, allowing you to focus on insights rather than coding complexities.

Authors

Features

  • Automated Model Selection
  • Streamlined Preprocessing
  • Quick Predictions
  • Flexible Configuration
  • Hyperparameter Tuning
  • Prediction accurately

Requirements

to Run this Tool you will Require Below this thing to do

Installation

Suggestion: Use a Virtual env for this to succesfully run it

  1. Install my Application with pip install
  pip install JustDataML

and Now you can use it as

   JDML --Config <Data Config file> --Train --Predict <test.csv> --Output <Output Name>
  1. Install by cloning this repository
  git clone https://github.com/jaideepsinhdabhi/JustDataML.git

Go into the Repo Folder

  cd JustDataML

we need to download all the required packages

  pip install -r requirements.txt

Usage/Examples CLI Tool

python JDML/JDML.py -- Config <Data Config File> --Train --Predict <test.csv> --Output <Output Name>

Arguments

  • -C, --Config: You need to provide a configuration file to obtain all the necessary arguments.

    Note: Make sure you give proper argument mentioned in the Document.

  • -T, --Train: [Optional] If provided, it initiates model training based on the specified data_config file.

    Note : Ensure that the Data files are available in the "Data" folder.

  • -P, --Predict: [Optional] If provided, performs prediction using the trained model. Make sure not to delete or modify the artifact folder to get the results.

  • -O, --Output: [Required with -P (--Predict)] Specifies the output name for the predicted dataframe generated from the test data. Note: it will also generate Model_summary file and if Hyperparameter is given Yes in Config file then it will generate stats file for that too

Usage/Examples as a Python library

we can also use this tool as a python library

First we will import required libraries

import pandas as pd # to read our data
from JDML.JDML import Just_Data_ML # to train , predict using JDML tool


# we will also import Libraries that we will use in our models

# for Normalization we can use other also
from sklearn.preprocessing import StandardScaler

# here for example we will use Random Forest and linear regression
from sklearn.ensemble import RandomForestRegressor 
from sklearn.linear_model import LinearRegression

we have create a instance to use jdml

jdml_in = Just_Data_ML()

Lets Read the Data

Here we are using a bioinformatics data (KEGG Metabolic Reaction Network (Undirected)) and we will predict e for our example.

you can find Data Here

data_df = pd.read_csv("Data/Reaction Network (Undirected).csv",index_col=False)

Let's Assign Target_Col and Features for our task

Feature_Kegg = ["Connected Components","Diameter","Radius","Centralization","Shortest Path","Characteristic Path","Avg.num.Neighbours","Density","Heterogeneity","Isolated Nodes","Number of Self Loops","Multi-edge Node Pair","NeighborhoodConnectivity","NumberOfDirectedEdges","Stress","SelfLoops",
"Partner Of MultiEdged NodePairs","Degree","TopologicalCoefficient","BetweennessCentrality","Radiality","Eccentricity","NumberOfUndirectedEdges","ClosenessCentrality","AverageShortestPathLength","ClusteringCoefficient","nodeCount"]


target_col = ['edgeCount']

we have consider all columns as feature but we can do a feature selection and train your model for those selected features. for now we will use all the features and predict edgeCount as Target.

Now we also need to assign few other paramters and arguments for our tool

Problem_Objective = 'Regression'   #task to choose between regresison and classification
Normalization = StandardScaler()  # Normalization technique our data

Models_to_train = {'Random Forest':RandomForestRegressor(n_jobs=-1), 'Linear Regression':LinearRegression()}   # we have select our models and parse it as dict to work.

For Model Training

jdml_in.Data_Train(Feature_Kegg,data_df,target_col,Models_to_train,    
    Problem_Objective,HyperParamter_Yes_or_No='Yes',  #by default it will be 'No'
    Normalization_tech=Normalization)

You will have your model trained now its time for prediction

# importing test csv as DataFrame
test_df = pd.read_csv("test.csv")

note: Yes, you do not need to preprocess test file it will be handled by tool itself.

Now we need to do prediction from our test file.

predictions = jdml_in.Predict_test(Problem_Objective,Feature_Kegg,test_df)

Voila You will have the Prediction file (Column Target_Out) in the Output DataFrame.

Data Configuration File Example

This is an example of a data configuration file (Data_Config.csv) used with the JustData_ML (JDML) tool. This file specifies the necessary information for training and predicting with machine learning models.

CSV Structure:

The CSV file contains the following fields:

  • Data_Name: Name of the dataset file with extension and it should be present in Data Folder (Iris.data in this example).

  • Features_Cols: Comma-separated list of feature columns in the dataset (sepal length, sepal width, petal length, petal width in this example).

  • Target_Col: Name of the target column in the dataset (class in this example).

  • Problem_Objective: Objective of the machine learning task (Classification or Regression).

  • Normalization_tech: Normalization technique to be applied (StandardScaler, MinMaxScaler, etc.).

  • Model_to_include: Models to include in the training process (ALL or specific models). (below Listed for Models)

  • HyperParamter_Yes_or_No: Indicates whether hyperparameter tuning should be performed (Yes or No).

    A sample csv file and some Data are there in Data Folder for a demo Run

    Note : It will also generate logs into a logs folder for every run check for every Run to get more idea on that.

Available Models for Regression and Classification

Regression Models:

  1. Random Forest

    • Description: Random Forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mean prediction of the individual trees.
  2. Decision Tree

    • Description: Decision Tree is a non-parametric supervised learning method used for classification and regression. It works by partitioning the input space into regions and predicting the target variable based on the average of the training instances in the corresponding region.
  3. Gradient Boosting

    • Description: Gradient Boosting is a machine learning technique for regression and classification problems that builds models in a stage-wise manner and tries to fit new models to the residuals of the previous models.
  4. Linear Regression

    • Description: Linear Regression is a linear approach to modeling the relationship between a dependent variable and one or more independent variables.
  5. XGBRegressor

    • Description: XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.
  6. Support Vector Reg

    • Description: Support Vector Regression (SVR) is a type of Support Vector Machine (SVM) algorithm that is used to predict a continuous variable.
  7. Linear Ridge

    • Description: Ridge Regression is a linear regression technique that is used to analyze multiple regression data that suffer from multicollinearity.
  8. Linear Lasso

    • Description: Lasso regression is a type of linear regression that uses shrinkage. It penalizes the absolute size of the regression coefficients.
  9. ElasticNet

    • Description: ElasticNet is a linear regression model that combines the properties of Ridge Regression and Lasso Regression.
  10. AdaBoost Regressor

    • Description: AdaBoost (Adaptive Boosting) is an ensemble learning method that combines multiple weak learners to create a strong learner.
  11. KNeighborsRegressor

    • Description: KNeighborsRegressor is a simple, non-parametric method used for regression tasks based on the k-nearest neighbors algorithm.

Classification Models:

  1. Logistic Regression

    • Description: Logistic Regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome.
  2. Ridge Classification

    • Description: Ridge Classifier is a classifier that uses Ridge Regression to classify data points.
  3. GaussianNB

    • Description: Gaussian Naive Bayes is a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions between the features.
  4. KNeighborsClassifier

    • Description: KNeighborsClassifier is a simple, instance-based learning algorithm used for classification tasks based on the k-nearest neighbors algorithm.
  5. Decision Tree Classifier

    • Description: Decision Tree Classifier is a non-parametric supervised learning method used for classification.
  6. Random Forest Classifier

    • Description: Random Forest Classifier is an ensemble learning method for classification that operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) of the individual trees.
  7. Support Vector Classifier

    • Description: Support Vector Classifier (SVC) is a type of Support Vector Machine (SVM) algorithm that is used for classification tasks.
  8. AdaBoost Classifier

    • Description: AdaBoost Classifier is an ensemble learning method that combines multiple weak learners to create a strong learner.
  9. Gradient Boosting Classifier

    • Description: Gradient Boosting Classifier is a machine learning technique for classification problems that builds models in a stage-wise manner and tries to fit new models to the residuals of the previous models.
  10. XGBClassifier

    • Description: XGBoost Classifier is an optimized distributed gradient boosting library designed for classification problems.

These are the available regression and classification models supported by the JDML tool. You can use them for training and prediction based on your specific machine learning tasks.

Feedback

If you have any feedback or suggestions , please reach out to us at jaideep.dabhi7603@gmail.com

Acknowledgements

Hi, I'm Jaideepsinh Dabhi (jD)! 👋

🚀 About Me

🚀 Data Scientist | Analytics Enthusiast | Python Aficionado 🐍
I'm a Data Scientist working in a BioTech Industry.
I am based out of India 🇮🇳
I love to code in python ,Bash and R
I have a strong base in statistics and Machine learning
I am passionate about networking and fostering meaningful connections within the tech community. Feel free to reach out if you'd like to discuss Data Science 👨🏻‍💻, Machine Learning 🦾, Chess ♞ or Pens 🖋️

🔗 Links

GitHub linkedin

License

This project is licensed under the Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

JustDataML-0.0.27.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

JustDataML-0.0.27-py3-none-any.whl (21.9 kB view details)

Uploaded Python 3

File details

Details for the file JustDataML-0.0.27.tar.gz.

File metadata

  • Download URL: JustDataML-0.0.27.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.0

File hashes

Hashes for JustDataML-0.0.27.tar.gz
Algorithm Hash digest
SHA256 9e439da6660aba127344e47fa235be0020af0b662a57e7e57a259719490ab5fa
MD5 622ba38796b9d31cb1f530bce025e99f
BLAKE2b-256 848674fa93ec854c38445e3b4a10a7fe77ae017ed6fba73ea6546d3435d875c8

See more details on using hashes here.

File details

Details for the file JustDataML-0.0.27-py3-none-any.whl.

File metadata

  • Download URL: JustDataML-0.0.27-py3-none-any.whl
  • Upload date:
  • Size: 21.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.0

File hashes

Hashes for JustDataML-0.0.27-py3-none-any.whl
Algorithm Hash digest
SHA256 fd2521bacdf5281a072683878a71284096c31f60a13d42198c4227d220e17299
MD5 4730d9bf673ae6718d87f20794f0cd6c
BLAKE2b-256 2e751dc2739811a9f2f8b635183a530e43d93b07da10bca67ffb8ac492f73469

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page