Skip to main content

CARTE-AI: Context Aware Representation of Table Entries for AI

Project description

CARTE:
Pretraining and Transfer for Tabular Learning

CARTE_outline

This repository contains the implementation of the paper CARTE: Pretraining and Transfer for Tabular Learning.

CARTE is a pretrained model for tabular data by treating each table row as a star graph and training a graph transformer on top of this representation.

Colab Examples (Give it a test):

Open In Colab

  • CARTERegressor on Wine Poland dataset
  • CARTEClassifier on Spotify dataset

Installation

Required dependencies

CARTE works with PyTorch and python version >=3.10. Create a new environment with python 3.10, and install the appropriate PyTorch version for your machine. Then, install the dependencies with the requirements.txt file on your environment:

pip install -r requirements.txt

In the requirements.txt file the package torch_scatter may depend on the specific PyTorch version. It is recommended to install the appropriate version by changing the first line ('--find-links') to the specific version outlined in https://data.pyg.org/whl/.

To reproduce the results presented in our paper, install additional requirements with requirements-optional.txt file on your environment:

pip install -r requirements-optional.txt

Downloading data

The download of required data (Fasttext, datasets, etc) can be managed by running

python scripts/download_data.py -op <option for datasets> -ir <include raw data> -ik <include KEN data>

or change the options in bash file and running it with

bash scripts/download_data.sh

Note that the code will download the FastText embeddings if it is not present under the data/etc folder. If the embeddings are stored in a different directory, please change the 'config_directory["fasttext"]' in the configs/directory.py

The variables are:

  • options (-op): Options to download preprocessed datasets used in our paper.
    Stored under data/data_singletable.

    • "carte" : No downloadings of datasets.
    • "basic_examples" : Download 4 preprocessed datasets for running examples.
    • "full_examples" : Download all 51 preprocessed datasets without the LLM features.
    • "full_benchmark" : Download all 51 preprocessed datasets including the LLM features.
  • include_raw (-ir) : Benchmark raw datasets
    The original datasets without any preprocessing. "True" to download all 51 datasets or "False" otherwise. Stored under data/data_raw. See scripts/preprocess_raw.py for specific details on preprocessing.

  • include_ken (-ik) : KEN (YAGO knowledge graph) embeddings
    The KEN embeddings, which are knowledge graph embeddings of YAGO entities. "True" to download the embeddings or "False" otherwise. Stored under data/etc.

Example (in the prepared environment) downloading FastText embeddings and the 4 datasets for examples for running CARTE:

python scripts/download_data.py -op "basic_examples" -ir "False" -ik "False"

The datasets can also be found https://huggingface.co/datasets/inria-soda/carte-benchmark.

Getting started

The best way to get familiar with using CARTE is through the examples. After setting up the datasets, run the following examples if needed.

Running CARTE for singletables:
follow through examples/1. carte_single_tables.ipynb

Running CARTE for multitables:
follow through examples/2. carte_joint_learning.ipynb

Note: To run through the examples, it is recommended to have at least 64GB of RAM for single tables and 128GB for multitables. We are currently working to reduce the memory consumption.

Reproducing results of CARTE paper

Currently, we provide codes for generating results for singletables. The updates for reproducing results on multitables will be updated.

To generate results for singletables, run:

python scripts/evaluate_singletable.py -dn <data name> -nt <train size> -m <method to evaluation> -rs <random state values> -b <include bagging> -dv <device to run>

The variables are:

  • data_name (-dn): Name of the dataset.
    specific name under the data/data_singletable folder or "all" to run all the datasets.

  • num_train (-nt) : Train size to evaluate.
    "all" to run train sizes of {32, 64, 128, 256, 512, 1024, 2048}.

  • method (-m) : Method to evaluate (See carte_singletable_baselines in configs/carte_configs)

    • "full" : the full list of all baselines (See carte_singletable_baselines['full']).
    • "reduced" : the reduced list of all baselines in CARTE paper (See carte_singletable_baselines['reduced']).
    • "f-r" : the list of baselines excluding the reduced list from the full list.
    • "any other method" : any other method in carte_singletable_baselines['full'].
  • random_state (-rs) : Random state value.
    "all" to run train sizes of {1, 2, 3, ..., 10}

  • bagging (-b) : Indicate to include the bagging strategy or not.
    "True" to include the bagging strategy in analysis. Note that for neural-networks based models, it runs the bagging strategy even when it is set to "False".

  • device (-dv) :
    "cpu" to run on cpus or "cuda" to run on gpus. Requires some specifications if ran on gpus.

Example running the 'wina poland' dataset with train size of 128 and random state 1 in the examples:

python scripts/evaluate_singletable.py -dn "wina_pl" -nt "128" -m "reduced" -rs "1" -b "False" -dv "cpu"

Running this will create a folder results/singletable/wina_pl, in which the results of each baseline will be stored as a csv file.

After obtaining the results under the results/singletable folder, run scripts/compile_results_singletable.py to compile results as a single dataframe, which will be saved as a csv file, named 'results_carte_baseline_singletable.csv', in the results/compiled_results folder.

Then, follow through examples/3. carte_singletable_visualization for visualization of the results.

The script does not run the random search (as done in the CARTE paper). To ease the computations and visualization, we provide the parameters for each baselines found from the random search. However, running the total comparison may take a long time, and it is recommended to run them on a parallel computing machines (e.g., clusters). The evaluation script only shows the guidelines for reproducing the results and modifications for parallelization suitable for each use-case should be made. For visualization purposes, we also provide the compiled results.

Our paper

@article{kim2024carte,
  title={CARTE: pretraining and transfer for tabular learning},
  author={Kim, Myung Jun and Grinsztajn, L{\'e}o and Varoquaux, Ga{\"e}l},
  journal={arXiv preprint arXiv:2402.16785},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

carte_ai-0.0.7.tar.gz (40.3 MB view details)

Uploaded Source

Built Distribution

carte_ai-0.0.7-py3-none-any.whl (40.3 MB view details)

Uploaded Python 3

File details

Details for the file carte_ai-0.0.7.tar.gz.

File metadata

  • Download URL: carte_ai-0.0.7.tar.gz
  • Upload date:
  • Size: 40.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for carte_ai-0.0.7.tar.gz
Algorithm Hash digest
SHA256 b5ec33ac16dfbbaff080e7062c59aae679086aeadb594f31308c148b67089f5f
MD5 c0b16ac4c517e2625e674e8ae090c68f
BLAKE2b-256 cfe3da5d83a3ec5c258126575461ffd3bc692f11e2b2904247e75f7d193de942

See more details on using hashes here.

File details

Details for the file carte_ai-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: carte_ai-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 40.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for carte_ai-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 0a8db4dd4cf8f1806598b4e4937d19a510b01f70969b4d792811b4766e4a8a06
MD5 2d3b44885a0baba61c8c3a42fb0d61be
BLAKE2b-256 0b3625670e47766483d611e8e2abc6e8eb902756c38afffa371fa01852d2f8d5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page