Skip to main content

Additional Tools for Machine Learning

Project description

XtraMLTools

This package contains Regression and Classification modules to augment the options in sklearn. Additionally, there is a preprocessing module to help with outlier detection.

Regression

The Regression module includes Linear, Quadratic, and Polynomial Regression classes. The Quadratic Regression class allows one to select numerical features in the dataset to model quadratically. It does this by adding to the dataframe additional columns equal to the products of the selected columns, and then applying sklearn's linear Additionally regression algorithm. The Polynomial Regression class is similar. One selects the numerical features to model with a polynomial, inputs the desired degree, and then Polynomial Regression will add the requisite columns to the dataframe and perform a linear regression. For instance, to model features n1, and n2

n1 n2
1 2
3 4
5 6
7 8
9 10
11 12

with 3rd degree polynomial, it will add additional features: n13, n12n2, n1n22, n23,

n1 n2 n13 n12n2 n1n22 n23
1 2 1 2 4 8
3 4 27 36 48 64
5 6 125 150 180 216
7 8 343 392 448 512
9 10 729 810 900 1000
11 12 1331 1452 1584 1728

Additionally, there are Categorial Linear, Categorical Quadratic, and Categorical Polynomial Regression Classes. Say we have two numerical features as before, and additionally two categorical features, c1, with three possible values, and c2, with two. After one-hot-encoding, and dropping the first columns, we might have something like this:

n1 n2 c1b c1c c2b
1 2 1 0 0
3 4 0 0 1
5 6 0 1 1
7 8 1 0 0
9 10 0 1 1
11 12 0 0 0

When we run an ordinary linear regression on this data in sklearn, the regression coefficients/slopes of the n1 and n2 features will be independent of the values of the categorical features.
This may not be desirable. For instance, if our target variable were y = distance, n1 were time, and c2 were gender, then we should generally expect the regression coefficient for n2 to be larger when c2 = male, than when c2 = female. If we wish to allow the numerical features' coefficients to vary with the categorical features' values, we can model the data with one of the Categorical Regression classes. They do this by multiplying the selected numerical features by the selected categorical features, adding these columns to the dataframe, and performing a linear regression.

Classification

The classification module is similar to the regression module. It allows one to add polynomial columns to the purely numeric features, and to also split these by whichever desired categorical features. Then it fits a logistic regression curve through the augmented features. This ought to be able to perfectly classify features that can be separated by any polynomial surface in the feature space of the numerical variables, with coefficients that possibly depend on the categorical features.

Preprocessing

There is as well a preprocessing module which contains a regression outlier removal class. This can be used to help identify outliers. One defines a dictionary of regressor models to feed into the object, fits the object to the data, and it will run a regression outlier removal program, progressively refinining the outlier estimations for each model until these converge. One can then compare the outlier predictions for each of the models employed. One can also look at the predictions of an aggregate model that combines the predictions of all models and uses a user threshold majority vote for making a final prediction on whether a point is an outlier or not.

Installation

so,

pip install XtraMLTools

Usage examples

Later, gator.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

XtraMLTools-0.1.2.tar.gz (23.6 kB view hashes)

Uploaded Source

Built Distribution

XtraMLTools-0.1.2-py3-none-any.whl (21.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page