Skip to main content

Feature Ordering Module from TabSeq (ICPR 2024)

Project description

TabSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering

Python License Sequencing Backbone Model Conference Citation Status

TabSeq is a cutting-edge framework designed to bridge the gap in applying deep learning to tabular datasets, which often has feature heterogeneous and sequential characteristics. By leveraging feature ordering, TabSeq organizes features to maximize their relevance and interactions, significantly improving the model's ability to learn from tabular data.

The framework incorporates:

  • Clustering to group features with similar characteristics in feature ordering.
  • Multi-Head Attention (MHA) to prioritize essential feature interactions.
  • Denoising Autoencoder (DAE) to reduce redundancy and reconstruct noisy inputs.

TabSeq has demonstrated remarkable performance across various real-world datasets, outperforming traditional methods. Its modular design and adaptability make it a powerful tool for both binary and multi-class classification tasks, addressing challenges in health informatics, financial modeling, and more.

Explore the potential of TabSeq and see how it transforms deep learning on tabular data.

Files

  • TabSeq_arxiv.pdf: Research paper (pre-print) describing the framework.
  • binary.py: Implementation for binary classification tasks.
  • multiclass.py: Implementation for multi-class classification tasks.

Requirements

  • Python 3.8+
  • numpy, pandas, scikit-learn, tensorflow, networkx

Citation

Al Zadid Sultan Bin Habib, Kesheng Wang, Mary-Anne Hartley, Gianfranco Doretto, and Donald A. Adjeroh. "TabSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering." In International Conference on Pattern Recognition (ICPR), 2024, pp. 418–434. Springer.

BibTeX:

@inproceedings{habib2024tabseq,
  title={TabSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering},
  author={Habib, Al Zadid Sultan Bin and Wang, Kesheng and Hartley, Mary-Anne and Doretto, Gianfranco and A. Adjeroh, Donald},
  booktitle={International Conference on Pattern Recognition},
  pages={418--434},
  year={2024},
  organization={Springer}
}

Installation

You can install TabSeq in multiple ways depending on your use case:


Option 1: Clone the Repository (Recommended for Development)

git clone https://github.com/zadid6pretam/TabSeq.git
cd TabSeq
pip install -r requirements.txt
pip install -e .

Option 2: Install via pip from GitHub (No Cloning Needed)

pip install git+https://github.com/zadid6pretam/TabSeq.git

Option 3: Install in a Virtual Environment

python -m venv tabseq-env
source tabseq-env/bin/activate  # On Windows: tabseq-env\Scripts\activate
git clone https://github.com/zadid6pretam/TabSeq.git
cd TabSeq
pip install -r requirements.txt
pip install -e .

Option 4: Manual Install Using setup.py

git clone https://github.com/zadid6pretam/TabSeq.git
cd TabSeq
pip install .

Option 5: Install from PyPI

pip install TabSeq

Example Usage

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tabseq.binary import train_binary_model
from tabseq.multiclass import train_multiclass_model

# Generate synthetic dataset
X = np.random.rand(40, 80)                   # 40 samples, 80 features
y_binary = np.random.randint(0, 2, 40)       # Binary labels (0 or 1)
y_multiclass = np.random.randint(0, 3, 40)   # Multiclass labels (0, 1, 2)

# Scale features
X_scaled = pd.DataFrame(StandardScaler().fit_transform(X))

# Split into train, valid, test
X_train, X_temp, y_train_b, y_temp_b = train_test_split(X_scaled, y_binary, test_size=0.4, stratify=y_binary)
X_valid, X_test, y_valid_b, y_test_b = train_test_split(X_temp, y_temp_b, test_size=0.5, stratify=y_temp_b)

_, X_temp, y_train_m, y_temp_m = train_test_split(X_scaled, y_multiclass, test_size=0.4, stratify=y_multiclass)
X_valid_m, X_test_m, y_valid_m, y_test_m = train_test_split(X_temp, y_temp_m, test_size=0.5, stratify=y_temp_m)

# Run TabSeq for Binary Classification
train_binary_model(X_train, X_valid, X_test, y_train_b, y_valid_b, y_test_b)

# Run TabSeq for Multi-Class Classification
train_multiclass_model(X_train, X_valid, X_test, y_train_m, y_valid_m, y_test_m, num_classes=3)

Default Parameter Values for Binary Classification

# =======================================================
# TabSeq Default Configuration Parameters (Binary Version)
# =======================================================
# Feature Ordering:
# - num_clusters: 5 (KMeans clustering is applied to transpose of feature matrix)
# - Intra-cluster ordering: Features sorted in descending order of variance
# - Global ordering: Integrated from local orderings using variance-based random weights

# Autoencoder (Denoising with Attention):
# - Noise: Gaussian noise with std = 0.1 added before training, clipped to [0, 1]
# - Attention Heads: 4
# - Attention Head Dimension (dk): 64
# - Dropout Rate in Attention: 0.1
# - Epochs: 50
# - Batch Size: 32
# - Loss Function: Mean Squared Error
# - Optimizer: Adam
# - EarlyStopping: patience = 5, monitor = 'val_loss', restore_best_weights = True

# Classifier:
# - Architecture: [Dense(128, relu) → BN → Dropout(0.5) → Dense(64, relu) → BN → Dropout(0.5) → Dense(1, sigmoid)]
# - Epochs: 50
# - Batch Size: 32
# - Loss Function: Binary Crossentropy
# - Metric: Accuracy
# - EarlyStopping: patience = 5, monitor = 'val_loss', restore_best_weights = True

Default Parameter Values for Multiclass Classification

# ===============================================
# TabSeq Default Configuration (Multiclass Version)
# ===============================================

# Feature Ordering:
# - num_clusters: 5 (KMeans clustering on transposed feature matrix)
# - Intra-cluster ordering: Features sorted by descending variance
# - Global ordering: Weighted integration of local orderings based on random-scaled variances

# Denoising Autoencoder with Multihead Attention:
# - Noise: Gaussian noise with std = 0.1, clipped between [0, 1]
# - Attention Heads: 4
# - Head Dimension (dk): 64
# - Dropout Rate in Attention: 0.1
# - Encoder: Dense(128 → 64), BatchNorm, Dropout(0.2)
# - Decoder: Dense(input_dim, sigmoid)
# - Epochs: 50
# - Batch Size: 32
# - Loss Function: Mean Squared Error
# - Optimizer: Adam
# - EarlyStopping: patience = 5, monitor = 'val_loss'

# Classifier:
# - Architecture: [Dense(128, relu) → BN → Dropout(0.5) → Dense(64, relu) → BN → Dropout(0.5) → Dense(num_classes, softmax)]
# - Loss Function: Categorical Crossentropy
# - Metric: Accuracy
# - Epochs: 50
# - Batch Size: 32
# - EarlyStopping: patience = 5, monitor = 'val_loss'

# Evaluation:
# - AUC: macro-average, using one-vs-rest (ovr)
# - Classification report: includes precision, recall, F1 for each class

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabseq_feature_ordering-0.1.0.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tabseq_feature_ordering-0.1.0-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file tabseq_feature_ordering-0.1.0.tar.gz.

File metadata

  • Download URL: tabseq_feature_ordering-0.1.0.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for tabseq_feature_ordering-0.1.0.tar.gz
Algorithm Hash digest
SHA256 98132d8630ecedd1ed96913a5680cebae67f2b09deec5ac34ef4b0fbf9f5622a
MD5 8149a28536d4e94f209a72ad1bf36f0d
BLAKE2b-256 8b0229e64d70c7c2e0cf6fa540330f39b8798b87af33225225c5549b714f9b9b

See more details on using hashes here.

File details

Details for the file tabseq_feature_ordering-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tabseq_feature_ordering-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4b9950e280588bde8405b29eb2a1596f7d159858b53aa63b68277fc3fcc8235a
MD5 727f896c4a235ac0c6f2e0b385a68180
BLAKE2b-256 561196e429012ff29adffc293375aae0caf5ce6c80568be221c6af96a5de5f01

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page