Skip to main content

Latent Dimensionality Reduction in Python

Project description

Latent Dimensionality Reduction Header

Status GitHub Issues GitHub Pull Requests License: GPL v3

LDR stands for Latent Dimensionality Reduction. It is a generic method for interpreting models. It is deployed here as a python module.

About

The purpose of LDR is to solve a common data science problem. Often models that have a higher predictive accuracy are more complex. These complex models are sensibly referred to as black box models. This is frustrating for many data scientists, as they end up with a model that performs well, but can't explain why. Inability to explain the model frequently causes the model to fail in critical situations, that are difficult to test for.

Getting Started

Prerequesites

Python3.

Note: Some examples contain additional that are not installed as dependencies by default, such as PyTorch.

Installation

python3 -m pip install ldr --user

Examples

An example analysis of a classification problem can be found here.

Explanation

The examples above generate either 1D or 2D visualizations. These demonstrate the certainty of the model across the sample space it is trained on. Take the 1D visualizations from the classification example, which describe the model's certainty of malignancy classification of mean area and area error of breast cancer samples in a study.

1 Dimensional Visualization

1D visualization example

Here what can be seen is that if a cancer has a mean area of less than 250 then the model is most likely to classify it as benign, whereas if it has a mean area of more than 1200 then it is almost certainly going to be classified as malignant.

With the area error, what can be seen is that between 280 and 480, there is not enough training data for the model to reliably make a prediction in that area, as the total model certainty drops to 0 for both classifications. Between 80 and 210 the model is very biased towards a malignant classification.

2 Dimensional Visualization

2D visualization example

Interpolating two classification certainties of individual features results in the heatmap in the bottom right. What can be seen is that the majority of samples have a low mean area and a low area error. When samples deviate away from the bottom left, with increasing mean area and area errors, the probability of a malignancy classification dramatically increases.

n Dimensional Visualizations

Even though only 1 and 2 dimensional visualizations are given, the entire classifier certainty across the sample space has been mapped. In this study there are actualy 30 features, but by using LDR the individual components can be easily extracted. In order for a human to interpret an algorithmic model, they must be able to see it. As Donald Knuth says:

"An algorithm must be seen to be believed, and the best way to learn what an algorithm is all about is to try it."

Note: most successful black box models come from algorithmic techniques, such as neural networks; this is why I draw this similarity.

Why LDR is Better Than Naive Techniques

Naive techniques for analysis black box models consist of analysing individual features (or composite functions of features) directly against their classifications resulting from each sample in the sample space. Because of this, the extrapolation that happens when applying the model to real world situations is never experimented against. Because LDR draws samples randomly from a kernel density estimate of the sample space, by default there is extrapolation.

The LDR Recipe

As machine learning algorithms go, LDR is pretty simple. It's effectively just a monte-carlo integration across the sample space, using the VEGAS algorithm as the sampling method.

  1. Min-max scale the data so that it falls into [0, 1] intervals.

  2. Train a predictive model on the scaled data.

  3. Create a kernel density estimate of the training samples.

  4. Sample n new points from the kernel density estimate, using the predictive model to make a prediction at each point.

  5. Bin the samples according to regular intervals. For each dimension, group points with resolution r, reducing the value of the bin to the mean prediction across it.

See the paper for more detail.

Additional Notes

If you find this package useful, please consider contributing!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ldr-0.5.tar.gz (16.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page