Skip to main content

A standard framework for using Deep Learning for tabular data

Project description

PyTorch Tabular

pypi Testing documentation status pre-commit.ci status Open In Colab

PyPI - Downloads DOI contributions welcome

PyTorch Tabular aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research alike. The core principles behind the design of the library are:

  • Low Resistance Useability
  • Easy Customization
  • Scalable and Easier to Deploy

It has been built on the shoulders of giants like PyTorch(obviously), and PyTorch Lightning.

Table of Contents

Installation

Although the installation includes PyTorch, the best and recommended way is to first install PyTorch from here, picking up the right CUDA version for your machine.

Once, you have got Pytorch installed, just use:

pip install pytorch_tabular[extra]

to install the complete library with extra dependencies (Weights&Biases & Plotly).

And :

pip install pytorch_tabular

for the bare essentials.

The sources for pytorch_tabular can be downloaded from the Github repo_.

You can either clone the public repository:

git clone git://github.com/manujosephv/pytorch_tabular

Once you have a copy of the source, you can install it with:

pip install .

or

python setup.py install

Documentation

For complete Documentation with tutorials visit ReadTheDocs

Available Models

  • FeedForward Network with Category Embedding is a simple FF network, but with an Embedding layers for the categorical columns.
  • Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data is a model presented in ICLR 2020 and according to the authors have beaten well-tuned Gradient Boosting models on many datasets.
  • TabNet: Attentive Interpretable Tabular Learning is another model coming out of Google Research which uses Sparse Attention in multiple steps of decision making to model the output.
  • Mixture Density Networks is a regression model which uses gaussian components to approximate the target function and provide a probabilistic prediction out of the box.
  • AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks is a model which tries to learn interactions between the features in an automated way and create a better representation and then use this representation in downstream task
  • TabTransformer is an adaptation of the Transformer model for Tabular Data which creates contextual representations for categorical features.
  • FT Transformer from Revisiting Deep Learning Models for Tabular Data
  • Gated Additive Tree Ensemble is a novel high-performance, parameter and computationally efficient deep learning architecture for tabular data. GATE uses a gating mechanism, inspired from GRU, as a feature representation learning unit with an in-built feature selection mechanism. We combine it with an ensemble of differentiable, non-linear decision trees, re-weighted with simple self-attention to predict our desired output.

Semi-Supervised Learning

  • Denoising AutoEncoder is an autoencoder which learns robust feature representation, to compensate any noise in the dataset.

To implement new models, see the How to implement new models tutorial. It covers basic as well as advanced architectures.

Usage

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
    ExperimentConfig,
)

data_config = DataConfig(
    target=[
        "target"
    ],  # target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
)
trainer_config = TrainerConfig(
    auto_lr_find=True,  # Runs the LRFinder to automatically derive a learning rate
    batch_size=1024,
    max_epochs=100,
)
optimizer_config = OptimizerConfig()

model_config = CategoryEmbeddingModelConfig(
    task="classification",
    layers="1024-512-512",  # Number of nodes in each layer
    activation="LeakyReLU",  # Activation between each layers
    learning_rate=1e-3,
)

tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)
tabular_model.fit(train=train, validation=val)
result = tabular_model.evaluate(test)
pred_df = tabular_model.predict(test)
tabular_model.save_model("examples/basic")
loaded_model = TabularModel.load_from_checkpoint("examples/basic")

Blogs

Future Roadmap(Contributions are Welcome)

  1. Add GaussRank as Feature Transformation
  2. Integrate Optuna Hyperparameter Tuning
  3. Integrate SHAP for interpretability
  4. Add Variable Importance
  5. Add ability to use custom activations in CategoryEmbeddingModel
  6. Add differential dropouts(layer-wise) in CategoryEmbeddingModel
  7. Add Fourier Encoding for cyclic time variables
  8. Add Text and Image Modalities for mixed modal problems

Contributors

manujosephv
Manu Joseph
wsad1
Jinu Sunil
Borda
Jirka Borovec
fonnesbeck
Chris Fonnesbeck
jxtrbtk
Null
JulianRein
Null
krshrimali
Kushashwa Ravi Shrimali
Actis92
Luca Actis Grosso
sgbaird
Sterling G. Baird
yinyunie
Yinyu Nie

Citation

If you use PyTorch Tabular for a scientific publication, we would appreciate citations to the published software and the following paper:

@misc{joseph2021pytorch,
      title={PyTorch Tabular: A Framework for Deep Learning with Tabular Data},
      author={Manu Joseph},
      year={2021},
      eprint={2104.13638},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
  • Zenodo Software Citation
@article{manujosephv_2021,
    title={manujosephv/pytorch_tabular: v0.7.0-alpha},
    DOI={10.5281/zenodo.5359010},
    abstractNote={<p>Added a few more SOTA models - TabTransformer, FTTransformer
        Made improvements in the model save and load capability
        Made installation less restrictive by unfreezing some dependencies.</p>},
    publisher={Zenodo},
    author={manujosephv},
    year={2021},
    month={May}
}

History

0.0.1 (2021-01-26)

  • First release on PyPI.

0.2.0 (2021-02-07)

  • Fixed an issue with torch.clip and torch version
  • Fixed an issue with gpus parameter in TrainerConfig, by setting default value to None for CPU
  • Added feature to use custom sampler in the training dataloader
  • Updated documentation and added a new tutorial for imbalanced classification

0.3.0 (2021-03-02)

  • Fixed a bug on inference

0.4.0 (2021-03-18)

  • Added AutoInt Model
  • Added Mixture Density Networks
  • Refactored the classes to separate backbones from the head of the models
  • Changed the saving and loading model to work for custom parameters that you pass in fit

0.5.0 (2021-03-18)

  • Added more documentation
  • Added Zenodo citation

0.6.0 (2021-06-21)

  • Upgraded versions of PyTorch Lightning to 1.3.6
  • Changed the way gpus parameter is handled to avoid confusion. None is CPU, -1 is all GPUs, int is number of GPUs
  • Added a few more Trainer Params like deterministic, auto_select_gpus
  • Some bug fixes and changes to docs
  • Added seed_everything to the fit method to ensure reproducibility
  • Refactored data_aware_initialization to be part of the BaseModel. Inherited Models can override the method to implement data aware initialization techniques

0.7.0 (2021-09-01)

  • Implemented TabTransformer and FTTransformer models
  • Included capability to save a model using GPU an load in CPU
  • Made the temp folder pytorch tabular specific to avoid conflicts with other tmp folders.
  • Some bug fixes
  • Edited an error out of Advanced Tutorial in docs

1.0.0 (2023-01-18)

  • Added a new task - Self Supervised Learning (SSL) and a separate training API for it.
  • Added new SOTA model - Gated Additive Tree Ensembles (GATE).
  • Added one SSL model - Denoising AutoEncoder.
  • Added lots of new tutorials and updated entire documentation.
  • Improved code documentation and type hints.
  • Separated a Model into separate Embedding, Backbone, and Head.
  • Refactored all models to separate Backbone as native PyTorch Model(nn.Module).
  • Refactored commonly used modules (layers, activations etc. to a common module).
  • Changed MixedDensityNetworks completely (breaking change). Now MDN is a head you can use with any model.
  • Enabled a low level api for training model.
  • Enabled saving and loading of datamodule.
  • Added trainer_kwargs to pass any trainer argument PyTorch Lightning supports.
  • Added Early Stopping and Model Checkpoint kwargs to use all the arguments in PyTorch Lightining.
  • Enabled prediction using GPUs in predict method.
  • Added reset_model to reset model weights to random.
  • Added many save and load functions including ONNX(experimental).
  • Added random seed as a parameter.
  • Switched over completely to Rich progressbars from tqdm.
  • Fixed class-balancing / mu propagation and set default to 1.0.
  • Added PyTorch Profiler for debugging performance issues.
  • Fixed bugs with FTTransformer and TabTransformer.
  • Updated MixedDensityNetworks fixing a bug with lambda_pi.
  • Many CI/CD improvements including complete integration with GitHub Actions.
  • Upgraded all dependencies, including PyTorch Lightning, pandas, to latest versions and added dependabot to manage it going forward.
  • Added pre-commit to ensure code integrity and standardization.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytorch_tabular-1.0.0.tar.gz (1.8 MB view hashes)

Uploaded Source

Built Distribution

pytorch_tabular-1.0.0-py2.py3-none-any.whl (119.6 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page