Skip to main content

Synthesize diverse multi-tabular relational databases using Structural Causal Models, enabling scaling laws for Relational Foundation Models.

Project description

PluRel

Synthetic Data unlocks Scaling Laws for Relational Foundation Models

Project Page arXiv

Scaling Law Plot

This repository provides a reference implementation for the paper PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models.

The architecture and training code is an improved version of the original implementation for the ICLR 2026 paper Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data.

Overview

PluRel is a framework for synthesizing diverse multi-tabular relational databases using Structural Causal Models (SCMs). This repository provides:

  • Scalable generation of synthetic relational data (from scratch or SQL schemas) compatible with relbench.
  • High-performance context sampling via a Rust-based sampler (rustler).
  • Pretraining of relational transformers on synthetic data.

Framework Design

PluRel Logo

Setup

Setup the development and testing environment with pixi.

# setup pixi environment
$ pixi install

# Compile and install the rust sampler
$ cd rustler && pixi run maturin develop --uv --release && cd ..

# Run tests
$ pixi run pytest

# Lint and format code
$ pixi run ruff check .
$ pixi run ruff format .

# Install pre-commit hooks
$ pixi run pre-commit install

# link cache repository
$ mkdir ~/scratch
$ ln -s ~/.cache/relbench ~/scratch/relbench

Synthesize Relational Data from Scratch

  • The SyntheticDataset class can be used to create relbench compatible dataset objects.
  • It only requires a seed and a Config object that contains database, scm and dag level params for sampling. See example below.
from plurel import SyntheticDataset, Config

# create relbench compatible dataset
dataset = SyntheticDataset(seed=0, config=Config())

# create database which can be cached via relbench APIs
db = dataset.make_db()

Configuration

The Config class controls all aspects of synthetic database generation through three parameter groups:

Parameters Description
DatabaseParams Table layout (BarabasiAlbert, ReverseRandomTree, WattsStrogatz), number of tables, row counts, column counts, and timestamp ranges.
SCMParams SCM graph layouts, column types, MLP initialization, activation functions, noise distributions, and time-series trend/cycle parameters.
DAGParams DAG-specific parameters like edge dropout, in-degree limits, and rewiring probabilities for different graph types.
from plurel import Config, DatabaseParams, SCMParams

config = Config(
    database_params=DatabaseParams(num_tables_choices=Choices(kind="range", value=[5, 10])),
    schema_file="path/to/schema.sql",  # optional: generate from SQL schema
    cache_dir="~/.cache/relbench",       # optional: cache generated databases
)

Scalable Generation

We also provide a multiprocessing-based script to generate databases in parallel.

$ pixi run python scripts/synthetic_gen.py \
    --seed_offset 0 \
    --num_dbs 1000 \
    --num_proc 16 \
    --preprocess
Argument Description
--seed_offset Seed offset for database generation. DBs will be named rel-synthetic-<seed>.
--num_dbs Number of databases to generate.
--num_proc Number of parallel processes (default: number of CPU cores).
--preprocess Run preprocessing and embedding steps. Omit to skip.

[!NOTE] Checkout notebooks in examples/ for synthesizing from SQL schemas

Download Preprocessed Data

The preprocessed synthetic data is available on the Hugging Face Hub at kvignesh1420/plurel.

  1. Install the HuggingFace CLI (if not present)
pixi add huggingface_hub
  1. Create the destination
mkdir -p ~/scratch/pre
  1. Download the repository contents into ~/scratch/pre
pixi run hf download kvignesh1420/plurel \
    --repo-type dataset \
    --local-dir ~/scratch/pre

The preprocessed relbench data is available on the Hugging Face Hub at hvag976/relational-transformer.

pixi run hf download hvag976/relational-transformer \
    --repo-type dataset \
    --local-dir ~/scratch/pre

Download Synthetic Pretrained Checkpoints

The synthetic pretrained model checkpoints are hosted on the Hugging Face Hub at kvignesh1420/relational-transformer-plurel.

$ mkdir -p ~/scratch/rt_hf_ckpts

$ pixi run hf download kvignesh1420/relational-transformer-plurel \
    --repo-type model \
    --local-dir ~/scratch/rt_hf_ckpts

One of the downloaded checkpoints will be listed as:

$ ls ~/scratch/rt_hf_ckpts

# model pretrained on a dataset of size 4B tokens curated from 1024 synthetic RDBs
synthetic-pretrain_rdb_1024_size_4b.pt

Pretraining Experiments

  • Baseline (real-world) pretraining on relbench datasets with a randomly initialized relational-transformer (RT) model.
$ pixi run torchrun --standalone --nproc_per_node=1 scripts/baseline_pretrain.py
  • Synthetic pretraining on varying number of databases and dataset sizes with a randomly initialized RT model.
$ pixi run torchrun --standalone --nproc_per_node=1 scripts/synthetic_pretrain.py
  • Continued pretraining on relbench datasets using the synthetic pretrained models. For faster experimentation, the downloaded models from huggingface (stored in ~/scratch/rt_hf_ckpts) can be passed to the load_ckpt_path argument in the training script.
$ pixi run torchrun --standalone --nproc_per_node=1 scripts/cntd_pretrain.py

Citation

If you find this work useful, please cite our paper:

@article{kothapalli2026plurel,
  title={{PluRel:} Synthetic Data unlocks Scaling Laws for Relational Foundation Models},
  author={Kothapalli, Vignesh and Ranjan, Rishabh and Hudovernik, Valter and Dwivedi, Vijay Prakash and Hoffart, Johannes and Guestrin, Carlos and Leskovec, Jure},
  journal={arXiv preprint arXiv:2602.04029},
  year={2026}
}

If you use the architecture, training loop or sampler code, please also cite the Relational Transformer paper:

@inproceedings{ranjan2026relationaltransformer,
    title={{Relational Transformer:} Toward Zero-Shot Foundation Models for Relational Data}, 
    author={Rishabh Ranjan and Valter Hudovernik and Mark Znidar and Charilaos Kanatsoulis and Roshan Upendra and Mahmoud Mohammadi and Joe Meyer and Tom Palczewski and Carlos Guestrin and Jure Leskovec},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plurel-1.0.0.tar.gz (23.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

plurel-1.0.0-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file plurel-1.0.0.tar.gz.

File metadata

  • Download URL: plurel-1.0.0.tar.gz
  • Upload date:
  • Size: 23.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for plurel-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c8234301a8e34a13964649c161b63221fc2e3ccb2f70cb0647d411ffc2b0ba5e
MD5 c6d8bbd288aa8c88025f788628d0573e
BLAKE2b-256 d0d4a660773904cbd58514f5719c9e6ebb46b9ffcdbf1b149c7849d654086d5a

See more details on using hashes here.

File details

Details for the file plurel-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: plurel-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 25.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for plurel-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dceea2d68855dc108e050fa255d27e48cb5b3c31ecca4fd57f38e14691fc47df
MD5 1d86e3f2865a96fbab3b85eabf99030b
BLAKE2b-256 2d0354b9765ffab3e7e04bbb3afec8959177cad080d9416d89d7fe9c765f72aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page