Transform tabular event data into sequences ready for Transformer and Sequential models: Life2Vec, BEHRT and more.
Project description
tab2seq
tab2seq turns multi-source tabular event data (registries, EHR, financial records) into tokenized sequences ready for Transformer-based models: it generalizes the data processing pipeline from the Life2Vec paper to arbitrary domains.
[!WARNING] This is an beta package. The core pipeline (Sources → Cohort → Vocabulary → EventDataset) is functional but the API is not yet stable. Documentation is incomplete. Pin to a specific version if you depend on current behaviour. See TODOs to see what is implemented at this point.
Why tab2seq?
Building a Life2Vec-style pipeline from scratch requires solving the same problems every time: multi-source schema alignment, leakage-safe vocabulary fitting, deterministic splits, and efficient Parquet-backed sequence iteration. tab2seq handles all of this so you can focus on modeling:
- Work with multiple longitudinal data sources (registries, databases)
- Define and filter cohorts based on inclusion criteria
- Create deterministic train/val/test splits with static context
- Fit a vocabulary on training data only (no leakage)
- Produce tokenized, model-ready event sequences with time features
- Generate realistic synthetic data for development and testing
Requires: Python ≥ 3.11, Numpy ≥ 2.0, Polars ≥ 1.38, Pydantic v2.
Documentation: See Documentation for additional information.
Pipeline
Sources → Cohort → Vocabulary → Tokenizer -> EventDataset → Model-ready Parquet
| Step | Class | What it does |
|---|---|---|
| 1 | Source / SourceCollection |
Schema declaration for each event table (categorical, continuous, temporal columns) |
| 2 | Cohort |
Entity universe + inclusion criteria + deterministic train/val/test splits |
| 3 | Vocabulary / Tokenizer |
Token mappings and bin edges fitted on train split only |
| 4 | EventDataset |
Vectorized token-ID encoding, relative-date features, Parquet persistence |
Installation
pip install tab2seq
Quick Start
The full pipeline from raw data to model-ready sequences in five steps.
1. Generate Synthetic Data
from tab2seq.datasets import generate_synthetic_data
import polars as pl
data_paths = generate_synthetic_data(
output_dir="synthetic_data",
n_entities=10_000,
seed=742,
registries=["health", "labour", "survey", "income"],
)
pl.read_parquet(data_paths["health"]).head()
shape: (5, 7)
┌───────────┬────────────┬───────────┬───────────┬──────────────────┬─────────┬────────────────┐
│ entity_id ┆ date ┆ diagnosis ┆ procedure ┆ department ┆ cost ┆ length_of_stay │
│ str ┆ date ┆ str ┆ str ┆ str ┆ f64 ┆ i64 │
╞═══════════╪════════════╪═══════════╪═══════════╪══════════════════╪═════════╪════════════════╡
│ E00001 ┆ 2016-09-15 ┆ J18.1 ┆ CABG ┆ gastroenterology ┆ 7306.17 ┆ 2 │
│ E00001 ┆ 2017-05-25 ┆ E78.0 ┆ XRAY ┆ neurology ┆ 138.65 ┆ 1 │
│ E00001 ┆ 2018-01-18 ┆ E78.0 ┆ MRI ┆ general_surgery ┆ 6704.59 ┆ 10 │
└───────────┴────────────┴───────────┴───────────┴──────────────────┴─────────┴────────────────┘
2. Define Sources
Each Source describes one event table: its file path, ID column, timestamp, and feature columns.
from tab2seq.source import (
Source, SourceCollection, SourceConfig,
CategoricalColConfig, ContinuousColConfig, TemporalColConfig,
)
configs = [
SourceConfig(
name="health",
filepath="synthetic_data/health.parquet",
id_col="entity_id",
categorical_cols=[
CategoricalColConfig(col_name="diagnosis", prefix="DIAG"),
CategoricalColConfig(col_name="procedure", prefix="PROC"),
CategoricalColConfig(col_name="department", prefix="DEPT"),
],
continuous_cols=[
ContinuousColConfig(col_name="cost", prefix="COST", n_bins=20, strategy="quantile"),
ContinuousColConfig(col_name="length_of_stay", prefix="LOS", n_bins=10, strategy="quantile"),
],
temporal_cols=[
TemporalColConfig(col_name="date", is_primary=True, drop_na=True, col_type="datetime"),
],
),
SourceConfig(
name="labour",
filepath="synthetic_data/labour.parquet",
id_col="entity_id",
categorical_cols=[
CategoricalColConfig(col_name="status", prefix="STATUS"),
CategoricalColConfig(col_name="occupation", prefix="OCC"),
CategoricalColConfig(col_name="residence_region", prefix="REGION"),
CategoricalColConfig(col_name="native_language", prefix="LANG", static=True),
],
continuous_cols=[
ContinuousColConfig(col_name="weekly_hours", prefix="WEEKLY_HOURS", n_bins=10, strategy="uniform"),
],
temporal_cols=[
TemporalColConfig(col_name="date", is_primary=True, drop_na=True, col_type="datetime"),
TemporalColConfig(col_name="birthday", static=True, drop_na=True, col_type="datetime"),
],
),
]
collection = SourceCollection.from_configs(configs)
for source in collection:
print(f"{source.name}: {len(source.get_entity_ids())} entities")
Columns marked
static=Trueare carried through to the cohort split table as entity-level attributes (e.g. birthday, native language).
3. Build a Cohort and Splits
A Cohort resolves one consistent entity universe across all sources, applies inclusion criteria, and generates deterministic train/val/test splits.
from tab2seq.cohort import Cohort, CohortConfig, EntityInclusionCriteria
criteria = [
EntityInclusionCriteria(source_name="health", required=False),
EntityInclusionCriteria(source_name="labour", required=True, min_events=1),
]
cohort = Cohort(
name="my_cohort",
sources=collection,
inclusion_criteria=criteria,
cache_dir="data/cohorts",
)
cohort.build_entities_table(force_recompute=True)
split_cfg = CohortConfig(train_frac=0.7, val_frac=0.15, test_frac=0.15, seed=42)
cohort.build_or_load_splits(split_cfg)
print(f"Cohort size: {len(cohort)} entities")
The split table contains one row per entity with the split label and all static columns.
4. Fit a Vocabulary (Train Split Only)
The vocabulary maps categorical values to token strings and bins continuous features—fitted exclusively on training entities to prevent leakage.
from tab2seq.tokenization import Tokenizer, Vocabulary, VocabularyConfig
vocab = Vocabulary(
config=VocabularyConfig(
max_vocab_size=50_000,
min_token_count=5,
# [PAD]=0 [UNK]=1 [CLS]=2 [SEP]=3 [MASK]=4 are always reserved.
# Add domain-specific tokens that should always appear:
extra_tokens=["[DEATH]", "[RETIRED]"],
)
)
vocab_df = vocab.fit_from_cohort_train(cohort=cohort, split_config=split_cfg)
print(f"Vocabulary size: {vocab_df.height}")
VocabularyConfig.count_mode controls how token frequency is computed for
min_token_count filtering:
overall: counts every token occurrence across all train events.entity_unique: counts each token at most once per entity.
Use entity_unique to reduce dominance from very prolific entities.
Two helpers are useful for inspecting a fitted vocabulary before encoding:
# Column → prefix mapping per source
print(vocab.column_prefixes("health"))
# {'cost': 'COST', 'length_of_stay': 'LOS', 'diagnosis': 'DIAG', ...}
# Bin edges for a continuous column (fitted on train data only)
print(vocab.bin_edges_for("health", "cost"))
5. Build and Persist Tokenized Event Datasets
EventDataset produces one row per event with integer token IDs, time features, and optional derived columns.
from tab2seq.datasets import EventDataset, EventDatasetConfig, RelativeDateRule
dataset = EventDataset(
cohort=cohort,
tokenizer=Tokenizer(vocab),
dataset_config=EventDatasetConfig(
reference_date="1970-01-01",
threshold_date="2021-01-01",
include_after_threshold=True,
include_token_str=True,
embed_static_in_events=False, # keep static features in a separate file
relative_date_features=[
RelativeDateRule(
source_static_column="labour__birthday",
output_column="age_years",
unit="years",
floor_int=True,
),
],
),
)
artifacts = dataset.write_parquet(dataset_name="my_dataset_v1", force_write=True)
print(artifacts.dataset_dir)
6. Load and Read Records
You can reload a saved dataset without rebuilding sources, cohort, or tokenizer.
dataset_loaded = EventDataset.from_name(
name="my_dataset_v1",
registry_dir=cohort.cache_dir / "datasets",
)
Four access patterns are available on any EventDataset:
# Fetch a specific entity by ID (returns None if not in that split)
record = dataset_loaded.get_entity_record("E00003", split="train")
# Random sample
record = dataset_loaded.sample_entity_record(split="train", seed=7)
# Full iterator sweep
for record in dataset_loaded.iter_entity_records(split="train", shuffle=True, seed=42):
# record = {"entity_id": ..., "split": ..., "static": {...}, "events": [...]}
pass
# Stateful one-at-a-time — remembers position across calls, returns None when exhausted
record = dataset_loaded.next_entity_record(split="val", shuffle=True, seed=0, reset=True)
while record is not None:
record = dataset_loaded.next_entity_record(split="val", shuffle=True, seed=0)
All four methods accept a format parameter:
| Format | Returns | Best for |
|---|---|---|
"raw" |
Python dicts (one dict per event) | inspection, custom collation |
"frame" |
Polars DataFrames | filtering, feature analysis |
"tensor" |
Flat NumPy arrays + event lengths | custom PyTorch/JAX collation |
"padded_tensor" |
2-D padded NumPy matrix + attention mask | direct DataLoader use |
raw (default)
record = dataset_loaded.sample_entity_record("train", seed=42, format="raw")
# record["entity_id"] → str
# record["split"] → "train" | "val" | "test"
# record["static"] → {"entity_id": ..., "labour__birthday": ..., "token_ids": [...], ...}
# record["events"] → list of dicts, one per event:
# event["primary_timestamp"] → "2015-01-01"
# event["source_name"] → "labour"
# event["token_ids"] → [105, 86, 98, 110, 3]
# event["age_years"] → 28 # relative-date feature
frame
Returns Polars DataFrames — avoids to_dicts() overhead for downstream filtering.
record = dataset_loaded.sample_entity_record("train", seed=7, format="frame")
# record["entity_id"] → str
# record["static_token_ids"] → list[int]
# record["events"] → polars.DataFrame with columns:
# primary_timestamp, source_name, token_ids (list[i64]), age_years, ...
tensor
Returns flat NumPy arrays. token_ids concatenates all events into a single 1-D array;
use event_lengths to split them back per event. temporal stacks time and any
relative-date features into a [num_events, T] float array.
Pass include_cls=True to prepend a [CLS] token to the sequence and include_sep=True
to insert a [SEP] token between events.
record = dataset_loaded.sample_entity_record(
"train", seed=7, format="tensor", include_cls=True, include_sep=True
)
# record["token_ids"] → ndarray shape (total_tokens,) — all events concatenated
# record["event_lengths"] → ndarray shape (num_events,) — tokens per event
# record["time"] → ndarray shape (num_events,) — days since reference_date
# record["temporal"] → ndarray shape (num_events, T) — time + rel-date features
# record["static_token_ids"] → list[int]
# Reconstruct per-event token lists
import numpy as np
per_event = np.split(record["token_ids"], np.cumsum(record["event_lengths"])[:-1])
padded_tensor
Like tensor but produces a 2-D [num_events, max_event_len] matrix padded with
pad_id. Drops directly into a PyTorch DataLoader without further collation.
record = dataset_loaded.sample_entity_record(
"train", seed=7, format="padded_tensor", pad_id=0
)
# record["token_ids"] → ndarray shape (num_events, max_event_len)
# record["attention_mask"] → bool ndarray shape (num_events, max_event_len)
# record["time"] → ndarray shape (num_events,)
# record["static_token_ids"] → list[int]
Synthetic Registries
generate_synthetic_data / generate_synthetic_collections create four registry-style tables with realistic temporal patterns, missing data, and cross-field correlations:
| Registry | Key columns |
|---|---|
| health | diagnosis, procedure, department, cost, length_of_stay |
| income | income_type, sector, income_amount |
| labour | status, occupation, weekly_hours, residence_region, birthday |
| survey | education_level, marital_status, self_rated_health, satisfaction_score |
Development
pip install -e ".[dev]"
pytest # run tests
pytest --cov=tab2seq # with coverage
black src/tab2seq tests # format
ruff check src/tab2seq tests # lint
Roadmap
- Synthetic datasets
-
Source/SourceCollection -
Cohort+ splits -
Vocabulary(leakage-safe) -
Tokenizer/EventDataset - Parquet persistence + caching
- Full Life2Vec / Life2Vec-Light preprocessing parity
- Subseting Cohorts for finetuning
- Example with the Tokenization and Transformer training
- Documentation site
Citation
If you use tab2seq, please cite:
@software{tab2seq2026,
author = {Savcisens, Germans},
title = {tab2seq: Scalable Tabular to Sequential Data Processing},
year = {2026},
url = {https://github.com/carlomarxdk/tab2seq}
}
And the original Life2Vec paper that inspired this work:
@article{savcisens2024using,
title={Using sequences of life-events to predict human lives},
author={Savcisens, Germans and Eliassi-Rad, Tina and Hansen, Lars Kai and Mortensen, Laust Hvas and Lilleholt, Lau and Rogers, Anna and Zettler, Ingo and Lehmann, Sune},
journal={Nature computational science},
volume={4},
number={1},
pages={43--56},
year={2024},
publisher={Nature Publishing Group US New York}
}
Acknowledgments
- Inspired by the data processing pipeline from Life2Vec and Life2Vec-Light
- Built with Polars and Pydantic.
Contributing
Contributions are welcome! Please open an issue or submit a pull request on GitHub.
License
MIT License - see LICENSE file for details.
Support
- 🐛 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tab2seq-0.1.9.tar.gz.
File metadata
- Download URL: tab2seq-0.1.9.tar.gz
- Upload date:
- Size: 69.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb36448dfff9d3160fb44031694ffed1f9154c6e542c721d49571432f925f8fa
|
|
| MD5 |
4dc4ec201484080d598e5070803d47c5
|
|
| BLAKE2b-256 |
01447e6162bb60613cfdfc8db368bc7abf13ce115c80cef3ff370979247d2e6b
|
Provenance
The following attestation bundles were made for tab2seq-0.1.9.tar.gz:
Publisher:
publish.yml on carlomarxdk/tab2seq
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tab2seq-0.1.9.tar.gz -
Subject digest:
fb36448dfff9d3160fb44031694ffed1f9154c6e542c721d49571432f925f8fa - Sigstore transparency entry: 1852195277
- Sigstore integration time:
-
Permalink:
carlomarxdk/tab2seq@9fdf5149f8c450109291432289fab4561682e7ad -
Branch / Tag:
refs/tags/v0.1.9 - Owner: https://github.com/carlomarxdk
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9fdf5149f8c450109291432289fab4561682e7ad -
Trigger Event:
release
-
Statement type:
File details
Details for the file tab2seq-0.1.9-py3-none-any.whl.
File metadata
- Download URL: tab2seq-0.1.9-py3-none-any.whl
- Upload date:
- Size: 56.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4ae0e29464435fa8dc169165cc8d8e6b7f831f4e46fec27a0bc3c2c6f295596
|
|
| MD5 |
fb06e35e8899761b23fb5cd90e0a760a
|
|
| BLAKE2b-256 |
50b9298e47035d5fc94a9b76526ae682764145251e5f37db115ce3646a576718
|
Provenance
The following attestation bundles were made for tab2seq-0.1.9-py3-none-any.whl:
Publisher:
publish.yml on carlomarxdk/tab2seq
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tab2seq-0.1.9-py3-none-any.whl -
Subject digest:
b4ae0e29464435fa8dc169165cc8d8e6b7f831f4e46fec27a0bc3c2c6f295596 - Sigstore transparency entry: 1852195355
- Sigstore integration time:
-
Permalink:
carlomarxdk/tab2seq@9fdf5149f8c450109291432289fab4561682e7ad -
Branch / Tag:
refs/tags/v0.1.9 - Owner: https://github.com/carlomarxdk
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9fdf5149f8c450109291432289fab4561682e7ad -
Trigger Event:
release
-
Statement type: