Skip to main content

No project description provided

Project description

Downloads

image

Quick Start

For basic understanding of the package, take a look at "intro" and "method" first.

Three classes are currently available: DenseGrowNetBase, ElasticNetLoss, and CustomLinearLayer.

Example DenseGrowNetBase initialization:

from dense_grownet import DenseGrowNetBase

model = DenseGrowNetBase(input_size, output_size, is_first_model=False)
# is_first_model=True will initialize a 1-layer neural network, which is essentially a generalized linear model or if False, will initialize
# a 2-layer neural network with ReLU activation, where the first layer is 'looks-linear' initialized and the second layer is zero-initialized

# Any model initialized with is_first_model=false will require passing in prev_output, or previous models' predictions
predictions = model(inputs, prev_output)

# To extract intermediate features generated by the first layer of a non-first model for concatenation to the next model's inputs, do
extracted_features = model.extract_features(inputs, prev_output)

Example ElasticNetLoss initialization:

from dense_grownet import ElasticNetLoss

criterion = ElasticNetLoss(criterion=nn.CrossEntropyLoss(), l1_lambda=0.01, l2_lambda=0.01)
# Loss is calculated as criterion_loss + l1_lambda * weights_l1_norm + l2_lambda * weights_l2_norm, with criterion as any desired loss function

Example CustomLinearLayer initialization:

from dense_grownet import CustomLinearLayer

linear_layer_2 = CustomLinearLayer(input_size, output_size, init="looks_linear")
# Two init mode supported: "zero" and "looks-linear". Look at method section for example of "looks-linear" initialization

Intro

Tree-based models and, particularly, gradient-boosted decision trees (GDBTs) have long been and continue to be state-of-the-art on medium-sized tabular data, despite extensive deep-learning research on such data types. Certain inductive biases of tree-based models, such as their rotationally variant learning procedure, which extracts information based on features' orientation, and their robustness against uninformative features, contribute to their strong performance on tabular data, in contrast to MLPs' rotationally invariant learning procedure and susceptibility to uninformative features (1). Gradient boosting's inductive bias can be explained as the bias towards explaining the largest proportion of variance through simpler interaction terms, with the contribution to variance decreasing as the order of interaction increases, rather than a large amount of high-order interaction terms, each explaining a small amount of variance. See more explanation at (2). This project aims to improve neural networks' performance on tabular data through a focus on investigating and applying the inductive biases of tree-based models, particularly, GDBTs, on MLP-like neural nets.

Method

A gradient boosting technique for neural networks is proposed, with superior performance on tabular data and applicability to other forms of data and neural networks. The first, base model is a one-layer zero-initialized neural network, which is a generalized linear model. A dedicated solver can be used for this step as we are only interested in the raw predictions of the first model. The second model, with the first layer "looks linear" initialized (3), the second layer zero-initialized, and the activation function belonging to the ReLU-family function, is then trained to predict the residuals or correct the error of the previous model. 'looks linear' initialization can be described as initializing in the following pattern: [[1 0 ... 0 0], [-1 0 ... 0 0], [0 1 ... 0 0], [0 -1 ... 0 0], ..., [0 0 ... 1 0], [0 0 ... -1 0], [0 0 ... 0 1], [0 0 ... 0 -1]], where there are 2N neurons for N features, and the second, final layer can easily replicate linear inputs as ReLU(0, x) - ReLU(0, -x) = x. For further explanation of 'looks linear' initialization, see (4). For each model after the second model, use the original features and the intermediate features generated by all previous models as input features for the current model. Initialize the next k models in a similar fashion as the second model. For further clarification, see the diagram above. Adjust regularization, learning rate, and epoch as appropriate.

Explanation

The 'looks linear' initialization preserves the original orientation of features at initialization. This alleviates the downside of a rotationally invariant learning procedure, where "intuitively, to remove uninformative features, a rotationally invariant algorithm has to first find the original orientation of the features, and then select the least informative ones: the information contained in the orientation of the data is lost" (1). Moreover, this achieves dynamical isometry, where the singular values of the network's input-output Jacobian concentrate near 1, which "has been shown to dramatically speed up learning", avoid vanishing/exploding gradients, and appears to improve generalization performance (5). Gradient boosting for neural networks where each model is only trained on the original features proved infeasible to train as training loss decreases much more slowly with each subsequent model. To address this, intermediate features from previous models were used along with the original features as input for subsequent models, which significantly improved training speed. The reuse of intermediate features appears to offer slightly greater generalization performance compared to training each model solely on the original features based on a preliminary experiment.

Additional

Elastic net regularization, which is L1 + L2 regularization, should be included as part of the training procedure, where both L1 and L2 regularization promote smaller weights. L1 regularization corresponds to the assumption of the Laplace distribution of weights, where weights tend to be sparse and are penalized proportionally based on their sizes. L2 regularization corresponds to the normal distribution of weights, where weights are penalized quadratically based on their sizes. It is known that L1 regularization is rotationally invariant and logistic regression with L1 regularization is robust against uninformative features, where sample complexity grows only logarithmically with the number of irrelevant features (6). One can, then, see that L1 regularization is also desirable to include as part of the learning procedure, as it mirrors the inductive biases that contribute to tree-based models' strong performance on tabular data.

Additional Notes

Softplus appears to improve generalization and trainability, provided that the curvature of softplus is appropriately small, eg. softplus(4 * x) / 4. This is because softplus has a tendency to mimic a linear or identity activation. If most of your values are concentrated at a very small range around 0, say from -0.2 to 0.2, softplus will look more like a linear or identity function rather than a piecewise linear function like ReLU, see example below.

image

Contact: nhatbui@tamu.edu (would be great if someone is looking to discuss, collaborate, or act as a mentor on this research project :D )

Cool read:

Keywords spam: GrowNet, DenseNet, ResNet, neural networks, deep learning, dynamical isometry, 'looks linear' initialization, ReLU, Softplus, activation function, gradient boosting, inductive bias, decision trees, regularization, L1, L2, elastic net

To do:

References:

  1. Why do tree-based models still outperform deep learning on typical tabular data?

    https://openreview.net/pdf?id=Fp7__phQszn

  2. Tim Goodman's explanation of gradient boosting's inductive bias

    https://stats.stackexchange.com/questions/173390/gradient-boosting-tree-vs-random-forest#comment945015_174020

  3. The Shattered Gradients Problem: If resnets are the answer, then what is the question?

    https://proceedings.mlr.press/v70/balduzzi17b/balduzzi17b.pdf

  4. "looks-linear" initialization explanation

    https://www.reddit.com/r/MachineLearning/comments/5yo30r/comment/desyjot/

  5. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

    https://arxiv.org/pdf/1711.04735

  6. Feature selection, L1 vs. L2 regularization, and rotational invariance

    https://icml.cc/Conferences/2004/proceedings/papers/354.pdf

  7. Smooth Adversarial Training

    https://arxiv.org/abs/2006.14536

  8. Reproducibility in Deep Learning and Smooth Activations

    https://research.google/blog/reproducibility-in-deep-learning-and-smooth-activations/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dense_grownet-0.1.7.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

dense_grownet-0.1.7-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file dense_grownet-0.1.7.tar.gz.

File metadata

  • Download URL: dense_grownet-0.1.7.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for dense_grownet-0.1.7.tar.gz
Algorithm Hash digest
SHA256 6d4ef72801aa1779cdd42ac879c0929ea05520bad7161253452edb53f5cc6205
MD5 7ba42f85321826ad84d903cb2840806d
BLAKE2b-256 038332dd98dba9da3ea753762ceaaae30682e9b9199061a70797f66db9ed22bb

See more details on using hashes here.

File details

Details for the file dense_grownet-0.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for dense_grownet-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 66f4a0a663cc12e5ff6a279fbc18b6c61bab2a7b055c2db6ba6fc6a818cfc8be
MD5 70b725df3af59c38e58bde87a23831da
BLAKE2b-256 2d8cefe0bbe5959825eac6f4d13103765ef66dd5a7f56747c6e3f9a7af010404

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page