Library which creates an entropy discretizier and overlays discretizers from main libraries such a sikit-learn
Project description
EntroDistroPy
EntroDistro is a library which brings a collections of binners or discretiziers. Also it aims to create a whole IA model in which can be inputed a csv, after it will be analyzed, processed and converted to talk the same language as a machine learning model. Finally it would should be able to extract conclusions. The use-case which will be used in the library it's how to improve the ROI when a product is bought. Another idea in development in this library is that could be tunned and know which parameters modify to improve the column chosen or desired( in our case will be the ROI) and be tunned to obtain the maximum ROI or extract conclusion on how the parameters behave.
Installation
Source file
- Download the source file from github
- Unzip and navigate to the folder containing
setup.py
and other files - Run the following command:
python setup.py install
Pip
pip3 install EntroDistroPy
What it's entropy discretization ? Why should be used?
Entropy: How a system is untidy
Discretization or Binning: Divide a column of data in bins which have a minimum and a maximum value
Discretization or binning by entropy is state of art method :telescope::rocket:,which means that is new an experimental, the idea is that every column is binned by its own entropy which means that the binning is made using the own feature of the column. That allows to the algorithm to make a better and most fitter binning of the column.
What's done here?
The main idea of this function is present the data as a machine learning model would wanted, for doing so a preprocess and adaption of the data must be done, the steps taken here are:
- Standarize -> rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).
- Normalize -> rescales the values into a range of [0,1]
- Discretize by entropy.
How does works the entropy discretizer
- Check that column target is inside the dataframe which is passed
- Remove the cols which have already been discretized by that algorithm
- Left in the pandas only the numeric type cols ( just Int or Floats)
- Cast the pandas dataframe to a numpy array
- Assign the X which is going to be the numpy array which will be discretizied
- Assign the Y which is going to be the target variable extracted from the dataframe and casted to a numpy array
- Obtain the number of features which is the same as number of cols present in the dataframe
- Input the data into the algorithm
- Transform one column at a time in the dataframe
- Save the cuts, and bin information from the discretizier
- Save the results of the discretization.
IA section
This section is a newbie due to it has been just added and not tested. The idea is the data outputed passed to a machine learning algorithm without any modification due to it has been done before. The model selected is Naive bayes depending on the data, its purpose, columns and other factors some models works better than others, for that in the same function which creates the machine learning model it has been left some test and measure to check and compare the models available.
How is been coded the naive bayes?
- Preprocess the Data
- Remove non int or float columns ( Categoricals )
- Remove target from the data
- If there are some columns discretized by entropy, hold them
- Remove ROI ( It was our use case)
- Extract the target
- Convert the dataframe to numpy arrays and cast them as floats just in case.
- Divide into train, test and validation
- Deploy machine learning algorithm
- Train
- Predict
- Obtain confusion matrix
- Obtain reports on how it performed
- Save report
- Save confusion matrix
Which algorithm of naive bayes has been selected
It has been selected the Multinomial Bayesian and Complemet bayesian as the main models to perform further operations . Due to with the files we have test the bayesian models, they were the models with the higher score.
As we can see in the following image:
What is missing and what will be done in the next release?
The part missing is take the output of the Naive bayes and with that create a graph, for doing so, it has been selected the library NetworkX. The use of this library is mandatory and the further development of this library will use NetworkX.
Notes and information about the variables used
It was used a csv which can not be updated. That cvs contained a column with the name and the value of the ROI of a service or an object purchased.
Why is selected the roi? Why did we want to improve the ROI?
The Roi is the Return of investment, so it's one of the most important variables. The objective among others, is to tag the ROI and divided into different categories( bins ), once it's done that, the idea is to improve every bin or just one level if it's desired. The model selected for doing so will be a Naives bayes machine from SciKit-learn. As it's a bayesian ( 2 options yes or no) model we must ask ourselves a question which is Does the ROI goes up?? If goes up: [YES] -> (alto_moderado, medio_alto, medio, medio-moderado) If goes down: [NO] -> (bajo_alto, bajo_moderado, deficiente, neutro)
All this process is made in the function eval_column, which depending on the number of categories selected in the Yes and No group, it binnes by those categories. The categories goes from 0 to X ( being X the number of categories), the lower category encapsules the lowest values of the column and the higher as logic the highest, so for the YES/NO decision we will set the threshold in 5, but can be changed if desired
Where those who come after me must go. Recommended readings
When this library was created this method didn't exist but now with the 0.23.2 version looks like it has been implemented Feature binnarization: https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-normalization It also could be used and recommend with a bernouilli distribution in a neural network.
What other things can you find in this library
- Folders with old functions called old_functions
- Folders with old notes in spanish called Notas,Aclaraciones y Errores
What's next?
The next steps as it has been stated we will be a creation of a graph using networkx. Also it will be improved the methods for the results visualization, due to evolution of matplotlib and other libraries has been left deprecated.
Python Compatibility
- Python - v3.7
Credits
This library has been created with the help of Scikit-learn, Networkx, Matplotlib and Numpy
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file EntroDistroPy-0.3.tar.gz
.
File metadata
- Download URL: EntroDistroPy-0.3.tar.gz
- Upload date:
- Size: 15.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11e340c69a9f71de8016cee8960c12f116d483352d689bdc67483d1c01fe79c6 |
|
MD5 | a4a8506d6fb928aee21aa3664d1111c5 |
|
BLAKE2b-256 | 5f42fd40037860c20f62fda9d03da4d2886205d9ed3cca81776cdec8407256e8 |