Additional Tools for Machine Learning
Project description
-- coding: utf-8 --
""" Created on Sun May 5 17:55:14 2024
@author: atdou """
XtraMLTools
This package contains Regression and Classification modules to augment the options in sklearn. Additionally, there is a preprocessing module to help with outlier detection.
Regression
The Regression module includes Linear, Quadratic, and Polynomial Regression classes. The Quadratic Regression class allows one to select numerical features in the dataset to model quadratically. It does this by adding to the dataframe additional columns equal to the products of the selected columns, and then applying sklearn's linear Additionally regression algorithm. The Polynomial Regression class is similar. One selects the numerical features to model with a polynomial, inputs the desired degree, and then Polynomial Regression will add the requisite columns to the dataframe and perform a linear regression. For instance, to model features n1, and n2
n1 | n2 |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
7 | 8 |
9 | 10 |
11 | 12 |
with 3rd degree polynomial, it will add additional features: n13, n12n2, n1n22, n23,
n1 | n2 | n13 | n12n2 | n1n22 | n23 |
---|---|---|---|---|---|
1 | 2 | 1 | 2 | 4 | 8 |
3 | 4 | 27 | 36 | 48 | 64 |
5 | 6 | 125 | 150 | 180 | 216 |
7 | 8 | 343 | 392 | 448 | 512 |
9 | 10 | 729 | 810 | 900 | 1000 |
11 | 12 | 1331 | 1452 | 1584 | 1728 |
Additionally, there are Categorial Linear, Categorical Quadratic, and Categorical Polynomial Regression Classes. Say we have two numerical features as before, and additionally two categorical features, c1, with three possible values, and c2, with two. After one-hot-encoding, and dropping the first columns, we might have something like this:
n1 | n2 | c1b | c1c | c2b |
---|---|---|---|---|
1 | 2 | 1 | 0 | 0 |
3 | 4 | 0 | 0 | 1 |
5 | 6 | 0 | 1 | 1 |
7 | 8 | 1 | 0 | 0 |
9 | 10 | 0 | 1 | 1 |
11 | 12 | 0 | 0 | 0 |
When we run an ordinary linear regression on this data in sklearn, the regression coefficients/slopes of the
n1 and n2 features will be independent of the values of the categorical features.
This may not be desirable. For instance, if our target variable were y = distance, n1 were time,
and c2 were gender, then we should generally expect the regression coefficient for n2
to be larger when c2 = male, than when c2 = female. If we wish to allow the numerical
features' coefficients to vary with the categorical features' values, we can model the data with one of the
Categorical Regression classes. They do this by multiplying the selected numerical features by the selected
categorical features, adding these columns to the dataframe, and performing a linear regression.
Classification
The classification module is similar to the regression module. It allows one to add polynomial columns to the purely numeric features, and to also split these by whichever desired categorical features. Then it fits a logistic regression curve through the augmented features. This ought to be able to perfectly classify features that can be separated by any polynomial surface in the feature space of the numerical variables, with coefficients that possibly depend on the categorical features.
Preprocessing
There is as well a preprocessing module which contains a regression outlier removal class. This can be used to help identify outliers. One defines a dictionary of regressor models to feed into the object, fits the object to the data, and it will run a regression outlier removal program, progressively refinining the outlier estimations for each model until these converge. One can then compare the outlier predictions for each of the models employed. One can also look at the predictions of an aggregate model that combines the predictions of all models and uses a user threshold majority vote for making a final prediction on whether a point is an outlier or not.
Installation
so,
pip install XtraMLTools
Usage examples
Later, gator.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for XtraMLTools-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a4604f0a3a019319c2253d6f67990b9203e2b1831d6dbdc8df4fea91b55dfed |
|
MD5 | 85b6564a4428a597b67b671122787f0e |
|
BLAKE2b-256 | 9299431ab600b9b50db02f4fddbb973a2d6ad2c29704bd56faa2644d177b2d94 |