Feature Ordering Module from TabSeq (ICPR 2024)
Project description
TabSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering
TabSeq is a cutting-edge framework designed to bridge the gap in applying deep learning to tabular datasets, which often has feature heterogeneous and sequential characteristics. By leveraging feature ordering, TabSeq organizes features to maximize their relevance and interactions, significantly improving the model's ability to learn from tabular data.
The framework incorporates:
- Clustering to group features with similar characteristics in feature ordering.
- Multi-Head Attention (MHA) to prioritize essential feature interactions.
- Denoising Autoencoder (DAE) to reduce redundancy and reconstruct noisy inputs.
TabSeq has demonstrated remarkable performance across various real-world datasets, outperforming traditional methods. Its modular design and adaptability make it a powerful tool for both binary and multi-class classification tasks, addressing challenges in health informatics, financial modeling, and more.
Explore the potential of TabSeq and see how it transforms deep learning on tabular data.
Files
- TabSeq_arxiv.pdf: Research paper (pre-print) describing the framework.
- binary.py: Implementation for binary classification tasks.
- multiclass.py: Implementation for multi-class classification tasks.
Requirements
- Python 3.8+
- numpy, pandas, scikit-learn, tensorflow, networkx
Citation
Al Zadid Sultan Bin Habib, Kesheng Wang, Mary-Anne Hartley, Gianfranco Doretto, and Donald A. Adjeroh. "TabSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering." In International Conference on Pattern Recognition (ICPR), 2024, pp. 418–434. Springer.
BibTeX:
@inproceedings{habib2024tabseq,
title={TabSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering},
author={Habib, Al Zadid Sultan Bin and Wang, Kesheng and Hartley, Mary-Anne and Doretto, Gianfranco and A. Adjeroh, Donald},
booktitle={International Conference on Pattern Recognition},
pages={418--434},
year={2024},
organization={Springer}
}
Installation
You can install TabSeq in multiple ways depending on your use case:
Option 1: Clone the Repository (Recommended for Development)
git clone https://github.com/zadid6pretam/TabSeq.git
cd TabSeq
pip install -r requirements.txt
pip install -e .
Option 2: Install via pip from GitHub (No Cloning Needed)
pip install git+https://github.com/zadid6pretam/TabSeq.git
Option 3: Install in a Virtual Environment
python -m venv tabseq-env
source tabseq-env/bin/activate # On Windows: tabseq-env\Scripts\activate
git clone https://github.com/zadid6pretam/TabSeq.git
cd TabSeq
pip install -r requirements.txt
pip install -e .
Option 4: Manual Install Using setup.py
git clone https://github.com/zadid6pretam/TabSeq.git
cd TabSeq
pip install .
Option 5: Install from PyPI
pip install TabSeq
Example Usage
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tabseq.binary import train_binary_model
from tabseq.multiclass import train_multiclass_model
# Generate synthetic dataset
X = np.random.rand(40, 80) # 40 samples, 80 features
y_binary = np.random.randint(0, 2, 40) # Binary labels (0 or 1)
y_multiclass = np.random.randint(0, 3, 40) # Multiclass labels (0, 1, 2)
# Scale features
X_scaled = pd.DataFrame(StandardScaler().fit_transform(X))
# Split into train, valid, test
X_train, X_temp, y_train_b, y_temp_b = train_test_split(X_scaled, y_binary, test_size=0.4, stratify=y_binary)
X_valid, X_test, y_valid_b, y_test_b = train_test_split(X_temp, y_temp_b, test_size=0.5, stratify=y_temp_b)
_, X_temp, y_train_m, y_temp_m = train_test_split(X_scaled, y_multiclass, test_size=0.4, stratify=y_multiclass)
X_valid_m, X_test_m, y_valid_m, y_test_m = train_test_split(X_temp, y_temp_m, test_size=0.5, stratify=y_temp_m)
# Run TabSeq for Binary Classification
train_binary_model(X_train, X_valid, X_test, y_train_b, y_valid_b, y_test_b)
# Run TabSeq for Multi-Class Classification
train_multiclass_model(X_train, X_valid, X_test, y_train_m, y_valid_m, y_test_m, num_classes=3)
Default Parameter Values for Binary Classification
# =======================================================
# TabSeq Default Configuration Parameters (Binary Version)
# =======================================================
# Feature Ordering:
# - num_clusters: 5 (KMeans clustering is applied to transpose of feature matrix)
# - Intra-cluster ordering: Features sorted in descending order of variance
# - Global ordering: Integrated from local orderings using variance-based random weights
# Autoencoder (Denoising with Attention):
# - Noise: Gaussian noise with std = 0.1 added before training, clipped to [0, 1]
# - Attention Heads: 4
# - Attention Head Dimension (dk): 64
# - Dropout Rate in Attention: 0.1
# - Epochs: 50
# - Batch Size: 32
# - Loss Function: Mean Squared Error
# - Optimizer: Adam
# - EarlyStopping: patience = 5, monitor = 'val_loss', restore_best_weights = True
# Classifier:
# - Architecture: [Dense(128, relu) → BN → Dropout(0.5) → Dense(64, relu) → BN → Dropout(0.5) → Dense(1, sigmoid)]
# - Epochs: 50
# - Batch Size: 32
# - Loss Function: Binary Crossentropy
# - Metric: Accuracy
# - EarlyStopping: patience = 5, monitor = 'val_loss', restore_best_weights = True
Default Parameter Values for Multiclass Classification
# ===============================================
# TabSeq Default Configuration (Multiclass Version)
# ===============================================
# Feature Ordering:
# - num_clusters: 5 (KMeans clustering on transposed feature matrix)
# - Intra-cluster ordering: Features sorted by descending variance
# - Global ordering: Weighted integration of local orderings based on random-scaled variances
# Denoising Autoencoder with Multihead Attention:
# - Noise: Gaussian noise with std = 0.1, clipped between [0, 1]
# - Attention Heads: 4
# - Head Dimension (dk): 64
# - Dropout Rate in Attention: 0.1
# - Encoder: Dense(128 → 64), BatchNorm, Dropout(0.2)
# - Decoder: Dense(input_dim, sigmoid)
# - Epochs: 50
# - Batch Size: 32
# - Loss Function: Mean Squared Error
# - Optimizer: Adam
# - EarlyStopping: patience = 5, monitor = 'val_loss'
# Classifier:
# - Architecture: [Dense(128, relu) → BN → Dropout(0.5) → Dense(64, relu) → BN → Dropout(0.5) → Dense(num_classes, softmax)]
# - Loss Function: Categorical Crossentropy
# - Metric: Accuracy
# - Epochs: 50
# - Batch Size: 32
# - EarlyStopping: patience = 5, monitor = 'val_loss'
# Evaluation:
# - AUC: macro-average, using one-vs-rest (ovr)
# - Classification report: includes precision, recall, F1 for each class
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tabseq_feature_ordering-0.1.0.tar.gz.
File metadata
- Download URL: tabseq_feature_ordering-0.1.0.tar.gz
- Upload date:
- Size: 6.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98132d8630ecedd1ed96913a5680cebae67f2b09deec5ac34ef4b0fbf9f5622a
|
|
| MD5 |
8149a28536d4e94f209a72ad1bf36f0d
|
|
| BLAKE2b-256 |
8b0229e64d70c7c2e0cf6fa540330f39b8798b87af33225225c5549b714f9b9b
|
File details
Details for the file tabseq_feature_ordering-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tabseq_feature_ordering-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b9950e280588bde8405b29eb2a1596f7d159858b53aa63b68277fc3fcc8235a
|
|
| MD5 |
727f896c4a235ac0c6f2e0b385a68180
|
|
| BLAKE2b-256 |
561196e429012ff29adffc293375aae0caf5ce6c80568be221c6af96a5de5f01
|