TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

These details have not been verified by PyPI

Project links

Project description

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

TabICL is a tabular foundation model like TabPFN. Currently, TabICL is only for classification tasks.

Architecture

TabICL processes tabular data through three sequential stages:

Column-wise Embedding: Creates distribution-aware embeddings for each feature
Row-wise Interaction: Captures interactions between features within each row
Dataset-wise In-Context Learning: Learns patterns from labeled examples to make predictions

Installation

pip install tabicl

Usage

Basic Usage

from tabicl import TabICLClassifier

clf = TabICLClassifier()
clf.fit(X_train, y_train)  # this is cheap
clf.predict(X_test)  # in-context learning happens here

The code above will automatically download the pre-trained checkpoint (~100MB) from Hugging Face Hub on first use and choose a GPU if available.

Advanced Configuration

TabICL offers a set of parameters to customize its behavior. The following example shows all available parameters with their default values and brief descriptions:

from tabicl import TabICLClassifier

clf = TabICLClassifier(
  n_estimators=32,                  # number of ensemble members
  norm_methods=["none", "power"],   # normalization methods to try
  feat_shuffle_method="latin",      # feature permutation strategy
  class_shift=True,                 # whether to apply cyclic shifts to class labels
  outlier_threshold=4.0,            # z-score threshold for outlier detection and clipping
  softmax_temperature=0.9,          # controls prediction confidence
  average_logits=True,              # whether ensemble averaging is done on logits or probabilities
  use_hierarchical=True,            # enable hierarchical classification for datasets with many classe
  batch_size=8,                     # process this many ensemble members together (reduce RAM usage)
  use_amp=True,                     # use automatic mixed precision for faster inference
  model_path=None,                  # where the model checkpoint is stored
  allow_auto_download=True,         # whether automatic download to the specified path is allowed
  device=None,                      # specify device for inference
  random_state=42,                  # random seed for reproducibility
  n_jobs=None,                      # number of threads to use for PyTorch
  verbose=False,                    # print detailed information during inference
  inference_config=None,            # inference configuration for fine-grained control
)

Memory-Efficient Inference

TabICL includes memory management to handle large datasets:

Memory Profiling: Built-in memory estimators for different components of the model
Batch Size Estimation: Dynamically determines optimal batch sizes based on available GPU memory
CPU Offloading: Automatically offloads intermediate results to CPU when beneficial
OOM Recovery: Recovers gracefully from out-of-memory errors by reducing batch size

Preprocessing

Simple built-in preprocessing

If the input X to TabICL is a pandas DataFrame, TabICL will automatically:

Detect and ordinal encode categorical columns (including string, object, category, and boolean types)
Create a separate category for missing values in categorical features
Perform mean imputation for missing numerical values (encoded as NaN)

If the input X is a numpy array, TabICL assumes that ordinal encoding and missing value imputation have already been performed.

For both input types, TabICL applies additional preprocessing:

Outlier detection and removal
Feature scaling and normalization
Feature shuffling for ensemble diversity

Advanced data preprocessing with skrub

Real-world datasets often contain complex heterogeneous data that benefits from more sophisticated preprocessing. For these scenarios, we recommend skrub, a powerful library designed specifically for advanced tabular data preparation.

Why use skrub?

Handles diverse data types (numerical, categorical, text, datetime, etc.)
Provides robust preprocessing for dirty data
Offers sophisticated feature engineering capabilities
Supports multi-table integration and joins

Installation

pip install skrub -U

Basic Integration

Use skrub's TableVectorizer to transform your raw data before passing it to TabICLClassifier:

from skrub import TableVectorizer
from tabicl import TabICLClassifier
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    TableVectorizer(),  # Automatically handles various data types
    TabICLClassifier()
)

pipeline.fit(X_train, y_train)  # X should be a DataFrame
predictions = pipeline.predict(X_test)

Key Features and Considerations:

Number of samples:
- TabICL is pretrained on datasets with up to 60K samples.
- TabICL can handle datasets beyond 100K samples thanks to memory-efficient inference.
- TabPFN (v2) is on average better than TabICL on small datasets with <10K samples, while TabICL is better on larger datasets.
- Classical methods may catch up with TabICL at around 40K samples but they are much slower due to extensive hyperparameter tuning.

Number of features:
- TabICL is pretrained on datasets with up to 100 features.
- TabICL can accommodate any number of features theoretically.
Number of classes:
- TabICL is pretrained on datasets with up to 10 classes, so it natively supports a maximum of 10 classes.
- However, TabICL can handle any number of classes thanks to its in-built hierarchical classification.
Inference speed:
- Like TabPFN, fit() does minimal work while predict() runs the full model
- At the same n_estimators, TabICL is usually 1x-5x faster than TabPFN
- TabICL benefits more from larger n_estimators, hence the default of 32
- Automatic mixed precision (AMP) provides further speed improvements on compatible GPUs
No tuning required: TabICL produces good predictions without hyperparameter tuning, unlike classical methods that require extensive tuning for optimal performance.

Performance

TabICL has achieved excellent results on the TALENT benchmark.

Code Availability

This repository currently only contains the inference code for TabICL. The pretraining code will probably be released in the future.

Citation

If you use TabICL for research purposes, please cite our paper:

@article{qu2025tabicl,
  title={TabICL: A Tabular Foundation Model for In-Context Learning on Large Data},
  author={Qu, Jingang and Holzm{\"u}ller, David and Varoquaux, Ga{\"e}l and Morvan, Marine Le},
  journal={arXiv preprint arXiv:2502.05564},
  year={2025}
}

Contributors

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.1.1

Apr 29, 2026

2.1.0

Apr 21, 2026

2.0.3

Mar 2, 2026

2.0.2

Feb 23, 2026

2.0.1

Feb 14, 2026

2.0.0

Feb 12, 2026

0.1.4

Nov 28, 2025

0.1.3

Jul 8, 2025

0.1.2

May 20, 2025

0.1.1

May 6, 2025

0.1.0

May 5, 2025

0.0.6

Mar 19, 2025

0.0.5

Mar 19, 2025

0.0.4

Mar 18, 2025

This version

0.0.3

Mar 17, 2025

0.0.2

Mar 13, 2025

0.0.1

Mar 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabicl-0.0.3.tar.gz (982.3 kB view details)

Uploaded Mar 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tabicl-0.0.3-py3-none-any.whl (52.4 kB view details)

Uploaded Mar 17, 2025 Python 3

File details

Details for the file tabicl-0.0.3.tar.gz.

File metadata

Download URL: tabicl-0.0.3.tar.gz
Upload date: Mar 17, 2025
Size: 982.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for tabicl-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`8d7a0bc3bb8f91750fd6e6971f56d9cc2a50aff8675d9edbed387502cc883941`
MD5	`109410b70d60624c7a2cf033b3f62d20`
BLAKE2b-256	`9414f16993ea2a5247bbb941e359811e98daa65264e76f1436f5e409e127a25b`

See more details on using hashes here.

File details

Details for the file tabicl-0.0.3-py3-none-any.whl.

File metadata

Download URL: tabicl-0.0.3-py3-none-any.whl
Upload date: Mar 17, 2025
Size: 52.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for tabicl-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`07eac6123471933c9385cc8e2d733eb170dd6b304b4ae81ccace49d7e513edbf`
MD5	`5288e35aba3d40482b836d923f24b445`
BLAKE2b-256	`77d6b576be81c7b0bc491888765e27cc112e3c858c9995c70aaa15e7a47216a0`

See more details on using hashes here.

tabicl 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

Architecture

Installation

Usage

Basic Usage

Advanced Configuration

Memory-Efficient Inference

Preprocessing

Simple built-in preprocessing

Advanced data preprocessing with skrub

Installation

Basic Integration

Key Features and Considerations:

Performance

Code Availability

Citation

Contributors

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes