Auto machine learning library based on the minimum nescience principle

## Project description

# Auto machine learning with the Nescience class

In this tutorial we are going to see how to use the class "Nescience", in order to compute the nescience (how much we do not know) of a model and a dataset. Also, we are going to see how to use the individual terms that compose the nescience, that is, miscoding, inaccuracy and surfeit.

For the details about the theory of nescience and its applications to artificial intelligence you can download the book http://www.mathematicsunknown.com/nescience.pdf for free.

## Installation

Download the Nescience directory. Make sure you run your script in the same directory where the Nescience subdirectory is located. Alternatively, you can put the Nescience subdirectory in a directory included in your PATH.

## Preliminaries

Import the following packages

import numpy as np import pandas as pd import matplotlib.pyplot as plt from matplotlib.pylab import rcParams

For the examples we are going to use the breast cancer dataset.

from sklearn.datasets import load_breast_cancer from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split

We will apply the nescience class to decision tree classifier models and multilayer perceptron models.

from sklearn.tree import DecisionTreeClassifier from sklearn.neural_network import MLPClassifier

We sill use also synthetic datasets to better understand the behaviour of the new metrics.

from sklearn.datasets.samples_generator import make_classification

## Miscoding

Miscoding is a measure of how well a dataset X encodes a response variable y. Miscoding can be used for feature selection (identify the most relevant features) or model evaluation (how well the model is using the dataset).

from Nescience.Nescience import Miscoding

### Feature Selection

Feature miscoding measures the effort required to encode the target y variable assuming the knowledge of an individual feature Xi. The higher this value, the better, since that means the feature contains relevant (and only relevant) information about the target.

We will use a synthetic dataset to show how this method works. We will randomly generate four clouds of points classified according to ten features among which only four are relevant.

n_samples = 1000 n_features = 20 n_informative = 4 n_classes = 4 X, y = make_classification(n_samples=n_samples, n_features=n_features, n_informative=n_informative, n_redundant=0, n_repeated=0, n_classes=n_classes, n_clusters_per_class=1, weights=None, flip_y=0, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=1)

We have to initialize the class Nescience with the dataset we are going to use.

miscoding = Miscoding() miscoding.fit(X, y)

Let's get the target conditional complexity with respect to all the features.

mscd = miscoding.miscoding_features()

And plot the results.

plt.bar(x=np.arange(0, 20), height=mscd, tick_label=np.arange(1, 21)) plt.xlabel("Feature") plt.ylabel("Miscoding") plt.title("Feature Selection with Miscoding") plt.show()

There are clearly four features that have some "predictive" power over the target variable. A value close to 1 means more predictive power.

Let's compare with a classical correlation schema.

df = pd.DataFrame(X) df['y'] = y corr = df.corr()

plt.bar(x=np.arange(0, 20), height=abs(corr['y'][:-1].values), tick_label=np.arange(1, 21)) plt.xlabel("Feature") plt.ylabel("Correlation") plt.title("Feature Miscoding") plt.title("Feature Selection with Correlation") plt.show()

In this case, there are three clear features correlated with the target variable, however, it is not clear if there is a fourth one.

### Model Miscoding

Now, let's compute the miscoding of a trained model. The miscoding of a model measures the "relevance" of the collection of features used in the model to predict the target variable.

data = load_breast_cancer() X = data.data y = data.target

miscoding = Miscoding() miscoding.fit(X, y)

tree = DecisionTreeClassifier(min_samples_leaf=5) tree.fit(X, y)

In order to do that we have to pass the model.

msd = miscoding.miscoding_model(tree) msd

0.8566936097053427

It seeems that the model is not using all the relevant attributes. Let's see which are the attributes in use.

attr_in_use = np.zeros(X.shape[1], dtype=int) features = set(tree.tree_.feature[tree.tree_.feature >= 0]) for i in features: attr_in_use[i] = 1 print(np.unique(data.feature_names[attr_in_use]))

['mean radius' 'mean texture']

print(np.unique(data.feature_names))

['area error' 'compactness error' 'concave points error' 'concavity error' 'fractal dimension error' 'mean area' 'mean compactness' 'mean concave points' 'mean concavity' 'mean fractal dimension' 'mean perimeter' 'mean radius' 'mean smoothness' 'mean symmetry' 'mean texture' 'perimeter error' 'radius error' 'smoothness error' 'symmetry error' 'texture error' 'worst area' 'worst compactness' 'worst concave points' 'worst concavity' 'worst fractal dimension' 'worst perimeter' 'worst radius' 'worst smoothness' 'worst symmetry' 'worst texture']

## Inaccuracy

The inaccuracy of a model, according to the theory of nescience, is the effort, measured as the length of a computer program, to fix the errors made by the model.

from Nescience.Nescience import Inaccuracy

data = load_digits() X = data.data y = data.target

tree = DecisionTreeClassifier(min_samples_leaf=5) tree.fit(X, y)

inacc = Inaccuracy() inacc.fit(X, y)

inacc.inaccuracy_model(tree)

0.17320124237643914

Compare the result with the model score:

1 - tree.score(X, y)

0.07846410684474125

### Check adding more errors

Let's see what happens if we make one hundred times the same error.

X2 = X.tolist() y2 = y.tolist() pred = tree.predict(X).tolist() for i in np.arange(100): X2.append(X2[0]) y2.append(y2[0]) pred.append( (y2[0]+1) % 10 )

inacc.fit(X2, y2) inacc.inaccuracy_predictions(pred)

0.20663657504275057

1 - tree.score(X2, y2)

0.07380073800738007

The theory of nescience states that making one hundred times the same error is not that bad. Let's see what happens if we make one hundred different errors.

X3 = X.tolist() y3 = y.tolist() pred = tree.predict(X).tolist() for i in np.arange(100): X3.append(X[0]) y3.append(y[0]) pred.append(np.random.randint(10))

inacc.fit(X3, y3) inacc.inaccuracy_predictions(pred)

0.24409238288694385

1 - tree.score(X3, y3)

0.07380073800738007

Making one hundred different errors is worse than making one hundred times the same error. Classical score does not take that into account.

### Imbalanced dataset

Let's see the behaviour of inaccuracy when fitting a model to a highly unbalanced dataset. We will use a decision tree classifier for which we require to have a minimum size for leafs. Given the amount of data, for high values of the leaf size, the tree will be not able to fit the data.

depth = list() score = list() inaccuracy = list()

for i in np.arange(1, 100): my_score = list() my_inaccuracy = list() for k in range(100): X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2, class_sep=2, flip_y=0, weights=[0.95,0.05]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) inacc.fit(X_test, y_test) tree = DecisionTreeClassifier(min_samples_leaf=i) tree.fit(X_train, y_train) my_score.append(1 - tree.score(X_test, y_test)) my_inaccuracy.append(inacc.inaccuracy_model(tree)) depth.append(i) score.append(np.mean(my_score)) inaccuracy.append(np.mean(my_inaccuracy))

plt.plot(depth, score, label="Score") plt.plot(depth, inaccuracy, label="Inaccuracy") plt.title("Isotropic Gaussian Blobs") plt.ylabel("Error") plt.xlabel("Minimum Leaf Size") plt.legend(loc='best')

As we can observe, score is not able to detect we have a problem with high values of the minimum leaf size parameter. However, inaccuracy tell us tha the model is wrong in those cases.

## Surfeit

Surfeit tell us how far we are from having the shortest possible model for a dataset. Also, surfeit allow us to compare models with very differnt assumputions and shapes.

Let's compare a decision tree and a neural network.

from Nescience.Nescience import Surfeit

data = load_digits() X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=.3)

tree = DecisionTreeClassifier() tree.fit(X_train, y_train) tree.score(X_test, y_test)

0.8444444444444444

nn = MLPClassifier() nn.fit(X_train, y_train) nn.score(X_test, y_test)

0.987037037037037

sft = Surfeit() sft.fit()

sft.surfeit_model(tree)

0.9523382805936712

sft.surfeit_model(nn)

0.6660665113323448

In this case, neural networks are a much better model to classiffy images than decision trees, not only because the have a higher score, but also, because the models are closer to the optimal one.

## Nescience

Nescience is a measure of how much we do not know about the problem at hand given a dataset and a model. Nescience is a function of the quantities we have already seen: miscoding, inaccuracy and surfeit. We are looking for a model that mininize the three components.

from Nescience.Nescience import Nescience, Miscoding, Inaccuracy, Surfeit

### Hyperparameter Optimization

Lets see how we can apply the concept of nescience to find an optimal value for one of the hyperparameters of decision trees. Please mind that this approach is different of the approach used in the NescienceDecisionTreeClassifer algorithm of the Nescience package.

data = load_breast_cancer() X = data.data y = data.target

nescience = Nescience() nescience.fit(X, y)

lmiscoding = list() linaccuracy = list() lredudancy = list() lnescience = list() miscoding = Miscoding() miscoding.fit(X, y) inaccuracy = Inaccuracy() inaccuracy.fit(X, y) surfeit = Surfeit() surfeit.fit() for i in range(10, 30): tree = DecisionTreeClassifier(min_samples_leaf=i) tree.fit(X, y) lmiscoding.append(miscoding.miscoding_model(tree)) linaccuracy.append(inaccuracy.inaccuracy_model(tree)) lredudancy.append(surfeit.surfeit_model(tree)) lnescience.append(nescience.nescience(tree))

fig, axs = plt.subplots(4, gridspec_kw={'hspace': 0.4, 'wspace': 0}) axs[0].plot(np.arange(10, 30), lmiscoding) axs[0].set_title('Miscoding') axs[1].plot(linaccuracy) axs[1].set_title('Inaccuracy') axs[2].plot(lredudancy) axs[2].set_title('Redudancy') axs[3].plot(lnescience) axs[3].set_title('Nescience')

The minimum nescience achieved is

np.min(lnescience)

0.44991785345528007

And so, the optimal number of samples at leafs should be

10 + np.argmin(lnescience)

16

Compare the result with the classical way to do this kind of things.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

lscore = list() for i in np.arange(10, 30): tree = DecisionTreeClassifier(min_samples_leaf=i) tree.fit(X_train, y_train) score = tree.score(X_test, y_test) lscore.append(score)

plt.plot(np.arange(10, 30), lscore)

max(lscore)

0.9649122807017544

10 + np.argmin(lnescience)

16

Both methods provide the same result. The nice point about the class nescience is that we have reached that conclusion without splitting the data in train and test subsets. That is, nescience avoids overfitting by desing.

## Project details

## Release history Release notifications

## Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|

Filename, size nescience-0.3-py3-none-any.whl (19.9 kB) | File type Wheel | Python version py3 | Upload date | Hashes View hashes |