JustDataML :Simplified Machine Learning for Everyone
Project description
JustDataML :Simplified Machine Learning for Everyone
Welcome to JustDataML, a user-friendly tool designed to make machine learning accessible to all, regardless of technical background. JustDataML automates the process of model selection, data preprocessing, and prediction, allowing you to focus on insights rather than coding complexities.
Authors
Features
- Automated Model Selection
- Streamlined Preprocessing
- Quick Predictions
- Flexible Configuration
- Hyperparameter Tuning
- Prediction accurately
Requirements
to Run this Tool you will Require Below this thing to do
Installation
Suggestion: Use a Virtual env for this to succesfully run it
- Install my Application with pip install
pip install JustDataML
and Now you can use it as
JDML --Config <Data Config file> --Train --Predict <test.csv> --Output <Output Name>
- Install by cloning this repository
git clone https://github.com/jaideepsinhdabhi/JustDataML.git
Go into the Repo Folder
cd JustDataML
we need to download all the required packages
pip install -r requirements.txt
Usage/Examples CLI Tool
python JDML/JDML.py -- Config <Data Config File> --Train --Predict <test.csv> --Output <Output Name>
Arguments
-
-C, --Config
: You need to provide a configuration file to obtain all the necessary arguments.Note: Make sure you give proper argument mentioned in the Document.
-
-T, --Train
: [Optional] If provided, it initiates model training based on the specified data_config file.Note : Ensure that the Data files are available in the "Data" folder.
-
-P, --Predict
: [Optional] If provided, performs prediction using the trained model. Make sure not to delete or modify the artifact folder to get the results. -
-O, --Output
: [Required with -P (--Predict)] Specifies the output name for the predicted dataframe generated from the test data. Note: it will also generate Model_summary file and if Hyperparameter is given Yes in Config file then it will generate stats file for that too
Usage/Examples as a Python library
we can also use this tool as a python library
First we will import required libraries
import pandas as pd # to read our data
from JDML.JDML import Just_Data_ML # to train , predict using JDML tool
# we will also import Libraries that we will use in our models
# for Normalization we can use other also
from sklearn.preprocessing import StandardScaler
# here for example we will use Random Forest and linear regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
we have create a instance to use jdml
jdml_in = Just_Data_ML()
Lets Read the Data
Here we are using a bioinformatics data (KEGG Metabolic Reaction Network (Undirected)) and we will predict e for our example.
you can find Data Here
data_df = pd.read_csv("Data/Reaction Network (Undirected).csv",index_col=False)
Let's Assign Target_Col and Features for our task
Feature_Kegg = ["Connected Components","Diameter","Radius","Centralization","Shortest Path","Characteristic Path","Avg.num.Neighbours","Density","Heterogeneity","Isolated Nodes","Number of Self Loops","Multi-edge Node Pair","NeighborhoodConnectivity","NumberOfDirectedEdges","Stress","SelfLoops",
"Partner Of MultiEdged NodePairs","Degree","TopologicalCoefficient","BetweennessCentrality","Radiality","Eccentricity","NumberOfUndirectedEdges","ClosenessCentrality","AverageShortestPathLength","ClusteringCoefficient","nodeCount"]
target_col = ['edgeCount']
we have consider all columns as feature but we can do a feature selection and train your model for those selected features. for now we will use all the features and predict edgeCount
as Target.
Now we also need to assign few other paramters and arguments for our tool
Problem_Objective = 'Regression' #task to choose between regresison and classification
Normalization = StandardScaler() # Normalization technique our data
Models_to_train = {'Random Forest':RandomForestRegressor(n_jobs=-1), 'Linear Regression':LinearRegression()} # we have select our models and parse it as dict to work.
For Model Training
jdml_in.Data_Train(Feature_Kegg,data_df,target_col,Models_to_train,
Problem_Objective,HyperParamter_Yes_or_No='Yes', #by default it will be 'No'
Normalization_tech=Normalization)
You will have your model trained now its time for prediction
# importing test csv as DataFrame
test_df = pd.read_csv("test.csv")
note: Yes, you do not need to preprocess test file it will be handled by tool itself.
Now we need to do prediction from our test file.
predictions = jdml_in.Predict_test(Problem_Objective,Feature_Kegg,test_df)
Voila You will have the Prediction file (Column Target_Out
) in the Output DataFrame.
Data Configuration File Example
This is an example of a data configuration file (Data_Config.csv
) used with the JustData_ML (JDML) tool. This file specifies the necessary information for training and predicting with machine learning models.
CSV Structure:
The CSV file contains the following fields:
-
Data_Name: Name of the dataset file with extension and it should be present in Data Folder (
Iris.data
in this example). -
Features_Cols: Comma-separated list of feature columns in the dataset (
sepal length, sepal width, petal length, petal width
in this example). -
Target_Col: Name of the target column in the dataset (
class
in this example). -
Problem_Objective: Objective of the machine learning task (
Classification
orRegression
). -
Normalization_tech: Normalization technique to be applied (
StandardScaler
,MinMaxScaler
, etc.). -
Model_to_include: Models to include in the training process (
ALL
or specific models). (below Listed for Models) -
HyperParamter_Yes_or_No: Indicates whether hyperparameter tuning should be performed (
Yes
orNo
).A sample csv file and some Data are there in Data Folder for a demo Run
Note : It will also generate logs into a logs folder for every run check for every Run to get more idea on that.
Available Models for Regression and Classification
Regression Models:
-
Random Forest
- Description: Random Forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mean prediction of the individual trees.
-
Decision Tree
- Description: Decision Tree is a non-parametric supervised learning method used for classification and regression. It works by partitioning the input space into regions and predicting the target variable based on the average of the training instances in the corresponding region.
-
Gradient Boosting
- Description: Gradient Boosting is a machine learning technique for regression and classification problems that builds models in a stage-wise manner and tries to fit new models to the residuals of the previous models.
-
Linear Regression
- Description: Linear Regression is a linear approach to modeling the relationship between a dependent variable and one or more independent variables.
-
XGBRegressor
- Description: XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.
-
Support Vector Reg
- Description: Support Vector Regression (SVR) is a type of Support Vector Machine (SVM) algorithm that is used to predict a continuous variable.
-
Linear Ridge
- Description: Ridge Regression is a linear regression technique that is used to analyze multiple regression data that suffer from multicollinearity.
-
Linear Lasso
- Description: Lasso regression is a type of linear regression that uses shrinkage. It penalizes the absolute size of the regression coefficients.
-
ElasticNet
- Description: ElasticNet is a linear regression model that combines the properties of Ridge Regression and Lasso Regression.
-
AdaBoost Regressor
- Description: AdaBoost (Adaptive Boosting) is an ensemble learning method that combines multiple weak learners to create a strong learner.
-
KNeighborsRegressor
- Description: KNeighborsRegressor is a simple, non-parametric method used for regression tasks based on the k-nearest neighbors algorithm.
Classification Models:
-
Logistic Regression
- Description: Logistic Regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome.
-
Ridge Classification
- Description: Ridge Classifier is a classifier that uses Ridge Regression to classify data points.
-
GaussianNB
- Description: Gaussian Naive Bayes is a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions between the features.
-
KNeighborsClassifier
- Description: KNeighborsClassifier is a simple, instance-based learning algorithm used for classification tasks based on the k-nearest neighbors algorithm.
-
Decision Tree Classifier
- Description: Decision Tree Classifier is a non-parametric supervised learning method used for classification.
-
Random Forest Classifier
- Description: Random Forest Classifier is an ensemble learning method for classification that operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) of the individual trees.
-
Support Vector Classifier
- Description: Support Vector Classifier (SVC) is a type of Support Vector Machine (SVM) algorithm that is used for classification tasks.
-
AdaBoost Classifier
- Description: AdaBoost Classifier is an ensemble learning method that combines multiple weak learners to create a strong learner.
-
Gradient Boosting Classifier
- Description: Gradient Boosting Classifier is a machine learning technique for classification problems that builds models in a stage-wise manner and tries to fit new models to the residuals of the previous models.
-
XGBClassifier
- Description: XGBoost Classifier is an optimized distributed gradient boosting library designed for classification problems.
These are the available regression and classification models supported by the JDML tool. You can use them for training and prediction based on your specific machine learning tasks.
Feedback
If you have any feedback or suggestions , please reach out to us at jaideep.dabhi7603@gmail.com
Acknowledgements
-
Krish Naik Playlist of End-to-End Machine Learning Project This playlist really helped a lot I followed till the end and tried to code at the same time.
-
Krish Naik Github Repo for Above Project Github Repo from the playlist for reference.
-
How to Build a Complete Python Package Step-by-Step by ArjanCodes. This video really helped me to write a python Package and published it to the PyPi
Hi, I'm Jaideepsinh Dabhi (jD)! 👋
🚀 About Me
🚀 Data Scientist | Analytics Enthusiast | Python Aficionado 🐍
I'm a Data Scientist working in a BioTech Industry.
I am based out of India 🇮🇳
I love to code in python ,Bash and R
I have a strong base in statistics and Machine learning
I am passionate about networking and fostering meaningful connections within the tech community. Feel free to reach out if you'd like to discuss Data Science 👨🏻💻, Machine Learning 🦾, Chess ♞ or Pens 🖋️
🔗 Links
License
This project is licensed under the Apache License 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file JustDataML-0.0.27.tar.gz
.
File metadata
- Download URL: JustDataML-0.0.27.tar.gz
- Upload date:
- Size: 18.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e439da6660aba127344e47fa235be0020af0b662a57e7e57a259719490ab5fa |
|
MD5 | 622ba38796b9d31cb1f530bce025e99f |
|
BLAKE2b-256 | 848674fa93ec854c38445e3b4a10a7fe77ae017ed6fba73ea6546d3435d875c8 |
File details
Details for the file JustDataML-0.0.27-py3-none-any.whl
.
File metadata
- Download URL: JustDataML-0.0.27-py3-none-any.whl
- Upload date:
- Size: 21.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd2521bacdf5281a072683878a71284096c31f60a13d42198c4227d220e17299 |
|
MD5 | 4730d9bf673ae6718d87f20794f0cd6c |
|
BLAKE2b-256 | 2e751dc2739811a9f2f8b635183a530e43d93b07da10bca67ffb8ac492f73469 |