A PyTorch-based library for self- and semi-supervised learning tabular models.

These details have not been verified by PyPI

Project links

Homepage

Project description

TabularS3L

Overview | Installation | Available Models with Quick Start Guides | To DO | Contributing

TabularS3L is a PyTorch-based library designed to facilitate self- and semi-supervised learning with tabular data. While numerous self- and semi-supervised learning tabular models have been proposed, there lacks a comprehensive library catering to the needs of tabular practitioners. This library aims to address this gap by offering a unified PyTorch Lightning-based framework for studying and deploying such models.

Installation

We provide a Python package ts3l of TabularS3L for users who want to use semi- and self-supervised learning tabular models.

pip install ts3l

Available Models with Quick Start

TabularS3L employs a two-phase learning approach, where the learning strategies differ between phases. Below is an overview of the models available within TabularS3L, highlighting the learning strategies employed in each phase. The abbreviations 'Self-SL', 'Semi-SL', and 'SL' represent self-supervised learning, semi-supervised learning, and supervised learning, respectively.

Model	First Phase	Second Phase
VIME (NeurIPS'20)	Self-SL	Semi-SL or SL
SubTab (NeurIPS'21)	Self-SL	SL
SCARF (ICLR'22)	Self-SL	SL

VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain

VIME enhances tabular data learning through a dual approach. In its first phase, it utilize a pretext task of estimating mask vectors from corrupted tabular data, alongside a reconstruction pretext task for self-supervised learning. The second phase leverages consistency regularization on unlabeled data.

Quick Start

# Assume that we have X_train, X_valid, X_test, y_train, y_valid, y_test, category_cols, and continuous_cols

# Prepare the VIMELightning Module
from ts3l.pl_modules import VIMELightning
from ts3l.utils.vime_utils import VIMEDataset
from ts3l.utils import TS3LDataModule
from ts3l.utils.vime_utils import VIMEConfig
from pytorch_lightning import Trainer

metric = "accuracy_score"
input_dim = X_train.shape[1]
hidden_dim = 1024
output_dim = 2
alpha1 = 2.0
alpha2 = 2.0
beta = 1.0
K = 3
p_m = 0.2

data_hparams = {
            "K" : K,
            "p_m" : p_m
        }

batch_size = 128

X_train, X_unlabeled, y_train, _ = train_test_split(X_train, y_train, train_size = 0.1, random_state=0, stratify=y_train)

config = VIMEConfig( task="classification", loss_fn="CrossEntropyLoss", metric=metric, metric_hparams={},
input_dim=input_dim, hidden_dim=hidden_dim,
output_dim=output_dim, alpha1=alpha1, alpha2=alpha2, 
beta=beta, K=K,
num_categoricals=len(category_cols), num_continuous=len(continuous_cols)
)

pl_vime = VIMELightning(config)

### First Phase Learning
train_ds = VIMEDataset(X = X_train, unlabeled_data = X_unlabeled, data_hparams=data_hparams, continous_cols = continuous_cols, category_cols = category_cols)
valid_ds = VIMEDataset(X = X_valid, data_hparams=data_hparams, continous_cols = continuous_cols, category_cols = category_cols)

datamodule = TS3LDataModule(train_ds, valid_ds, batch_size, train_sampler='random', n_jobs = 4)

trainer = Trainer(
                    accelerator = 'cpu',
                    max_epochs = 10,
                    num_sanity_val_steps = 2,
    )

trainer.fit(pl_vime, datamodule)

### Second Phase Learning
from ts3l.utils.vime_utils import VIMESemiSLCollateFN

pl_vime.set_second_phase()

train_ds = VIMEDataset(X_train, y_train.values, data_hparams, unlabeled_data=X_unlabeled, continous_cols=continuous_cols, category_cols=category_cols, is_second_phase=True)
valid_ds = VIMEDataset(X_valid, y_valid.values, data_hparams, continous_cols=continuous_cols, category_cols=category_cols, is_second_phase=True)
        
datamodule = TS3LDataModule(train_ds, valid_ds, batch_size = batch_size, train_sampler="weighted", train_collate_fn=VIMESemiSLCollateFN())

trainer.fit(pl_vime, datamodule)

# Evaluation
from sklearn.metrics import accuracy_score
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, SequentialSampler

test_ds = VIMEDataset(X_test, category_cols=category_cols, continous_cols=continuous_cols, is_second_phase=True)
test_dl = DataLoader(test_ds, batch_size, shuffle=False, sampler = SequentialSampler(test_ds))

preds = trainer.predict(pl_vime, test_dl)
        
preds = F.softmax(torch.concat([out.cpu() for out in preds]).squeeze(),dim=1)

accuracy = accuracy_score(y_test, preds.argmax(1))

print("Accuracy %.2f" % accuracy)

SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning

SubTab turns the task of learning from tabular data into as a multi-view representation challenge by dividing input features into multiple subsets during its first phase. During the second phase, collaborative inference is used to derive a joint representation by aggregating latent variables across subsets. This approach improves the model's performance in supervised learning tasks.

Quick Start

# Assume that we have X_train, X_valid, X_test, y_train, y_valid, y_test, categorical_cols, and continuous_cols

# Prepare the SubTabLightning Module
from ts3l.pl_modules import SubTabLightning
from ts3l.utils.subtab_utils import SubTabDataset, SubTabCollateFN
from ts3l.utils import TS3LDataModule
from ts3l.utils.subtab_utils import SubTabConfig
from pytorch_lightning import Trainer

metric = "accuracy_score"
input_dim = X_train.shape[1]
hidden_dim = 1024
output_dim = 2
tau = 1.0
use_cosine_similarity = True
use_contrastive = True
use_distance = True
n_subsets = 4
overlap_ratio = 0.75

mask_ratio = 0.1
noise_type = "Swap"
noise_level = 0.1

data_hparams = {
            "n_subsets" : n_subsets,
            "overlap_ratio" : overlap_ratio,
            "mask_ratio" : mask_ratio,
            "noise_type" : noise_type,
            "noise_level" : noise_level,
            "n_column" : input_dim
        }

batch_size = 128
max_epochs = 3

X_train, X_unlabeled, y_train, _ = train_test_split(X_train, y_train, train_size = 0.1, random_state=0, stratify=y_train)

config = SubTabConfig( task="classification", loss_fn="CrossEntropyLoss", metric=metric, metric_hparams={},
input_dim=input_dim, hidden_dim=hidden_dim,
output_dim=output_dim, tau=tau, use_cosine_similarity= use_cosine_similarity, use_contrastive=use_contrastive, use_distance=use_distance, 
n_subsets=n_subsets, overlap_ratio=overlap_ratio
)

pl_subtab = SubTabLightning(config)

### First Phase Learning
train_ds = SubTabDataset(X_train, unlabeled_data=X_unlabeled)
valid_ds = SubTabDataset(X_valid)

datamodule = TS3LDataModule(train_ds, valid_ds, batch_size, train_sampler='random', train_collate_fn=SubTabCollateFN(data_hparams), valid_collate_fn=SubTabCollateFN(data_hparams), n_jobs = 4)

trainer = Trainer(
                    accelerator = 'cpu',
                    max_epochs = max_epochs,
                    num_sanity_val_steps = 2,
    )

trainer.fit(pl_subtab, datamodule)

### Second Phase Learning

pl_subtab.set_second_phase()

train_ds = SubTabDataset(X_train, y_train.values)
valid_ds = SubTabDataset(X_valid, y_valid.values)

datamodule = TS3LDataModule(train_ds, valid_ds, batch_size = batch_size, train_sampler="weighted", train_collate_fn=SubTabCollateFN(data_hparams), valid_collate_fn=SubTabCollateFN(data_hparams))

trainer.fit(pl_subtab, datamodule)

# Evaluation
from sklearn.metrics import accuracy_score
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, SequentialSampler

test_ds = SubTabDataset(X_test)
test_dl = DataLoader(test_ds, batch_size, shuffle=False, sampler = SequentialSampler(test_ds), num_workers=4, collate_fn=SubTabCollateFN(data_hparams))

preds = trainer.predict(pl_subtab, test_dl)
        
preds = F.softmax(torch.concat([out.cpu() for out in preds]).squeeze(),dim=1)

accuracy = accuracy_score(y_test, preds.argmax(1))

print("Accuracy %.2f" % accuracy)

SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption

SCARF introduces a contrastive learning framework specifically tailored for tabular data. By corrupting random subsets of features, SCARF creates diverse views for self-supervised learning in its first phase. The subsequent phase transitions to supervised learning, utilizing a pretrained encoder to enhance model accuracy and robustness.

Quick Start

# Assume that we have X_train, X_valid, X_test, y_train, y_valid, y_test, categorical_cols, and continuous_cols

# Prepare the SubTabLightning Module
from ts3l.pl_modules import SubTabLightning
from ts3l.utils.subtab_utils import SubTabDataset, SubTabCollateFN
from ts3l.utils import TS3LDataModule
from ts3l.utils.subtab_utils import SubTabConfig
from pytorch_lightning import Trainer

metric = "accuracy_score"
input_dim = X_train.shape[1]
hidden_dim = 1024
output_dim = 2
tau = 1.0
use_cosine_similarity = True
use_contrastive = True
use_distance = True
n_subsets = 4
overlap_ratio = 0.75

mask_ratio = 0.1
noise_type = "Swap"
noise_level = 0.1

data_hparams = {
            "n_subsets" : n_subsets,
            "overlap_ratio" : overlap_ratio,
            "mask_ratio" : mask_ratio,
            "noise_type" : noise_type,
            "noise_level" : noise_level,
            "n_column" : input_dim
        }

batch_size = 128
max_epochs = 3

X_train, X_unlabeled, y_train, _ = train_test_split(X_train, y_train, train_size = 0.1, random_state=0, stratify=y_train)

config = SubTabConfig( task="classification", loss_fn="CrossEntropyLoss", metric=metric, metric_hparams={},
input_dim=input_dim, hidden_dim=hidden_dim,
output_dim=output_dim, tau=tau, use_cosine_similarity= use_cosine_similarity, use_contrastive=use_contrastive, use_distance=use_distance, 
n_subsets=n_subsets, overlap_ratio=overlap_ratio
)

pl_subtab = SubTabLightning(config)

### First Phase Learning
train_ds = SubTabDataset(X_train, unlabeled_data=X_unlabeled)
valid_ds = SubTabDataset(X_valid)

datamodule = TS3LDataModule(train_ds, valid_ds, batch_size, train_sampler='random', train_collate_fn=SubTabCollateFN(data_hparams), valid_collate_fn=SubTabCollateFN(data_hparams), n_jobs = 4)

trainer = Trainer(
                    accelerator = 'cpu',
                    max_epochs = max_epochs,
                    num_sanity_val_steps = 2,
    )

trainer.fit(pl_subtab, datamodule)

### Second Phase Learning

pl_subtab.set_second_phase()

train_ds = SubTabDataset(X_train, y_train.values)
valid_ds = SubTabDataset(X_valid, y_valid.values)

datamodule = TS3LDataModule(train_ds, valid_ds, batch_size = batch_size, train_sampler="weighted", train_collate_fn=SubTabCollateFN(data_hparams), valid_collate_fn=SubTabCollateFN(data_hparams))

trainer.fit(pl_subtab, datamodule)

# Evaluation
from sklearn.metrics import accuracy_score
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, SequentialSampler

test_ds = SubTabDataset(X_test)
test_dl = DataLoader(test_ds, batch_size, shuffle=False, sampler = SequentialSampler(test_ds), num_workers=4, collate_fn=SubTabCollateFN(data_hparams))

preds = trainer.predict(pl_subtab, test_dl)
        
preds = F.softmax(torch.concat([out.cpu() for out in preds]).squeeze(),dim=1)

accuracy = accuracy_score(y_test, preds.argmax(1))

print("Accuracy %.2f" % accuracy)

To DO

Release nn.Module and Dataset of VIME, SubTab, and SCARF
- VIME
- SubTab
- SCARF
Release LightningModules of VIME, SubTab, and SCARF
- VIME
- SubTab
- SCARF
Release Denoising AutoEncoder
- nn.Module
- LightningModule
Release SwitchTab
- nn.Module
- LightningModule
Release PTaRL
- Add Backbones
  - MLP
  - ResNet
  - FT-Transformer
- LightningModule
Add example codes

Contributing

Contributions to this implementation are highly appreciated. Whether it's suggesting improvements, reporting bugs, or proposing new features, feel free to open an issue or submit a pull request.

Credit

@software{alcoholrithm_2024_10776538,
  author       = {Minwook Kim},
  title        = {TabularS3L},
  month        = mar,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {v0.20},
  doi          = {10.5281/zenodo.10776538},
  url          = {https://doi.org/10.5281/zenodo.10776538}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.70

Mar 2, 2025

0.60

Nov 29, 2024

0.50

Jun 19, 2024

0.41

May 21, 2024

0.40

May 2, 2024

0.30

Apr 22, 2024

0.21

Mar 12, 2024

This version

0.20

Mar 4, 2024

0.10

Feb 15, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ts3l-0.20.tar.gz (28.9 kB view details)

Uploaded Mar 4, 2024 Source

Built Distribution

ts3l-0.20-py3-none-any.whl (34.0 kB view details)

Uploaded Mar 4, 2024 Python 3

File details

Details for the file ts3l-0.20.tar.gz.

File metadata

Download URL: ts3l-0.20.tar.gz
Upload date: Mar 4, 2024
Size: 28.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for ts3l-0.20.tar.gz
Algorithm	Hash digest
SHA256	`8c31c2d9127322ee8ffb0861d33ffeb36a7b6f7acad23b4a92368ff302f61390`
MD5	`45de50d5b943e318f5a0d70ad047011a`
BLAKE2b-256	`c70086fef1dc6e00ff5543fb7b9c46fe101b4399abcc78285b2743a247bf2f62`

See more details on using hashes here.

File details

Details for the file ts3l-0.20-py3-none-any.whl.

File metadata

Download URL: ts3l-0.20-py3-none-any.whl
Upload date: Mar 4, 2024
Size: 34.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for ts3l-0.20-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5dd4921ad33c0b087090e8b659e9a8f2a4d8e97c491f2eb4e8fb4e6f04b61158`
MD5	`c36acfd404c66c817742db38d3afe587`
BLAKE2b-256	`1af9d2bd5918719c37c50e0ea6a50090e8646cf30ee5d3cfd26871cfe107602b`

See more details on using hashes here.

ts3l 0.20

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

TabularS3L

Installation

Available Models with Quick Start

VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain

SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning

SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption

To DO

Contributing

Credit

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes