Transform tabular event data into sequences ready for Transformer and Sequential models: Life2Vec, BEHRT and more.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

con.ver.ge

These details have not been verified by PyPI

Project links

Documentation

Project description

tab2seq

tab2seq adapts the Life2Vec data processing pipeline to make it easy to work with multi-source tabular event data for sequential modeling projects. Transform registry data, EHR records, and other event-based datasets into tokenized sequences ready for Transformer and sequential deep learning models. The package reimplements the data-preprocessing steps of the life2vec and life2vec-light repos.

[!INFO] This is a BETA version of the package.

About

This package extracts and generalizes the data processing patterns from the Life2Vec project, making them reusable for similar research projects that need to:

Work with multiple longitudinal data sources (registries, databases)
Define and filter cohorts based on inclusion criteria
Create deterministic train/val/test splits with static context
Fit a vocabulary on training data only (no leakage)
Produce tokenized, model-ready event sequences with time features
Generate realistic synthetic data for development and testing

Whether you're working with healthcare data, financial records, or any time-stamped event data, tab2seq provides the building blocks for preparing data for Life2Vec-style sequential models.

Pipeline Overview

Sources → Cohort → Vocabulary → EventDataset → Model-ready Parquet

Sources – Define one SourceConfig per event table (health visits, labour records, income, etc.). Each config declares which columns are categorical, continuous, or timestamps.
Cohort – Unite sources into a single entity universe, apply inclusion criteria, and split into train/val/test with deterministic seeds.
Vocabulary – Fit token mappings and continuous-feature bin edges on the train split only to prevent leakage.
EventDataset – Build tokenized event rows per split, derive relative-date features (e.g. age), and persist to Parquet with metadata.

Features

Multi-Source Data Management: Handle multiple data sources (registries) with unified schema
Cohort Construction: Entity-level inclusion criteria across sources, deterministic splits, static-attribute propagation
Train-Only Vocabulary: Token and bin-edge fitting restricted to training entities
Tokenized Event Datasets: Vectorized token-ID encoding, relative-date features, Parquet persistence
Entity Record Access: Iterator, random sample, and stateful next() retrieval patterns for downstream training loops
Type-Safe Configuration: Pydantic-based configuration with YAML support
Synthetic Data Generation: Generate realistic dummy registry data for testing and exploration
Memory-Efficient Loading: Chunked iteration and lazy loading with Polars

Installation

pip install tab2seq

Quick Start

The full pipeline from raw data to model-ready sequences in five steps.

1. Generate Synthetic Data

from tab2seq.datasets import generate_synthetic_data
import polars as pl

data_paths = generate_synthetic_data(
    output_dir="synthetic_data",
    n_entities=10_000,
    seed=742,
    registries=["health", "labour"],
)
pl.read_parquet(data_paths["health"]).head()

2. Define Sources

Each Source describes one event table: its file path, ID column, timestamp, and feature columns.

from tab2seq.source import (
    Source, SourceCollection, SourceConfig,
    CategoricalColConfig, ContinuousColConfig, TimestampColConfig,
)

configs = [
    SourceConfig(
        name="health",
        filepath="synthetic_data/health.parquet",
        id_col="entity_id",
        categorical_cols=[
            CategoricalColConfig(col_name="diagnosis", prefix="DIAG"),
            CategoricalColConfig(col_name="procedure", prefix="PROC"),
            CategoricalColConfig(col_name="department", prefix="DEPT"),
        ],
        continuous_cols=[
            ContinuousColConfig(col_name="cost", prefix="COST", n_bins=20),
            ContinuousColConfig(col_name="length_of_stay", prefix="LOS", n_bins=10),
        ],
        timestamp_cols=[
            TimestampColConfig(col_name="date", is_primary=True, drop_na=True),
        ],
    ),
    SourceConfig(
        name="labour",
        filepath="synthetic_data/labour.parquet",
        id_col="entity_id",
        categorical_cols=[
            CategoricalColConfig(col_name="status", prefix="STATUS"),
            CategoricalColConfig(col_name="occupation", prefix="OCC"),
            CategoricalColConfig(col_name="residence_region", prefix="REGION"),
            CategoricalColConfig(col_name="native_language", prefix="LANG", static=True),
        ],
        continuous_cols=[
            ContinuousColConfig(col_name="weekly_hours", prefix="WEEKLY_HOURS", n_bins=10),
        ],
        timestamp_cols=[
            TimestampColConfig(col_name="date", is_primary=True, drop_na=True),
            TimestampColConfig(col_name="birthday", static=True, drop_na=True),
        ],
    ),
]

collection = SourceCollection.from_configs(configs)

for source in collection:
    print(f"{source.name}: {len(source.get_entity_ids())} entities")

Columns marked static=True are carried through to the cohort split table as entity-level attributes (e.g. birthday, native language).

3. Build a Cohort

A Cohort resolves one consistent entity universe across all sources, applies inclusion criteria, and generates deterministic train/val/test splits.

from tab2seq.cohort import Cohort, CohortConfig, EntityInclusionCriteria

criteria = [
    EntityInclusionCriteria(source_name="health", required=False),
    EntityInclusionCriteria(source_name="labour", required=True, min_events=1),
]

cohort = Cohort(
    name="my_cohort",
    sources=collection,
    inclusion_criteria=criteria,
    cache_dir="data/cohorts",
)

entities_df = cohort.build_entities_table(force_recompute=True)
print(f"Cohort size: {len(cohort)} entities")

split_cfg = CohortConfig(train_frac=0.7, val_frac=0.15, test_frac=0.15, seed=42)
split_df = cohort.build_or_load_splits(split_cfg, force_recompute=True)
split_df.head()

The split table contains one row per entity with the split label and all static columns.

4. Fit a Vocabulary (Train Only)

The vocabulary maps categorical values to token strings and bins continuous features—fitted exclusively on training entities to prevent leakage.

from tab2seq.config import TokenizerConfig
from tab2seq.tokenization import Vocabulary

tok_cfg = TokenizerConfig()
tok_cfg.vocabulary.min_token_count = 1
tok_cfg.vocabulary.max_vocab_size = 50_000

vocab = Vocabulary(tok_cfg.vocabulary)
vocab_df = vocab.fit_from_cohort_train(
    cohort=cohort,
    split_config=split_cfg,
    force_recompute=True,
)
print(f"Vocabulary size: {vocab_df.height}")

5. Build Tokenized Event Datasets

EventDataset produces one row per event with integer token IDs, time features, and optional derived columns.

from tab2seq.datasets import EventDataset, EventDatasetConfig, RelativeDateRule

dataset_cfg = EventDatasetConfig(
    reference_date="1970-01-01",
    threshold_date="2021-01-01",
    include_after_threshold=True,
    include_token_str=True,
    relative_date_features=[
        RelativeDateRule(
            source_static_column="labour__birthday",
            output_column="age_years",
            unit="years",
        ),
    ],
)

dataset = EventDataset(
    cohort=cohort,
    vocabulary=vocab,
    split_config=split_cfg,
    dataset_config=dataset_cfg,
)

# Inspect one split in memory
train_events = dataset.build_split("train", force_recompute_splits=True)
print(train_events.select(
    ["entity_id", "source_name", "primary_timestamp", "token_ids", "age_years"]
).head(5))

# Persist all splits + static table + metadata to Parquet
artifacts = dataset.write_parquet(force_recompute_splits=True)
print(artifacts.split_paths)

Retrieving Entity Records

Three patterns for feeding records into a training loop:

# Full iterator sweep
for record in dataset.iter_entity_records(split="train", shuffle=True, seed=42):
    # record = {"entity_id": ..., "split": ..., "static": {...}, "events": [...]}
    pass

# Random sample
record = dataset.sample_entity_record(split="train", seed=7)

# Stateful next() — remembers position across calls
record = dataset.next_entity_record(split="train", shuffle=True, seed=0, reset=True)
while record is not None:
    record = dataset.next_entity_record(split="train", shuffle=True, seed=0)

Synthetic Registries

generate_synthetic_data / generate_synthetic_collections create four registry-style tables with realistic temporal patterns, missing data, and cross-field correlations:

Registry	Key columns
health	diagnosis, procedure, department, cost, length_of_stay
income	income_type, sector, income_amount
labour	status, occupation, weekly_hours, residence_region, birthday
survey	education_level, marital_status, self_rated_health, satisfaction_score

Use Cases

Healthcare Research: Transform electronic health records (EHR) into sequences for predictive modeling
Registry Data Processing: Work with multiple event-based registries (health, income, labour, surveys)
Sequential Modeling: Prepare multi-source data for Life2Vec, BEHRT, or other transformer-based models
Data Pipeline Development: Use synthetic data to develop and test processing pipelines before working with sensitive real data

TODOs

Synthetic Datasets
Source implementation
Cohort implementation
Cohort and data splits
Tokenization implementation
Vocabulary implementation
EventDataset builder
Caching and chunking
Documentation

Citation

If you use this package in your research, please cite:

@software{tab2seq2026,
  author = {Savcisens, Germans},
  title = {tab2seq: Scalable Tabular to Sequential Data Processing},
  year = {2026},
  url = {https://github.com/carlomarxdk/tab2seq}
}

And the original Life2Vec paper that inspired this work:

@article{savcisens2024using,
  title={Using sequences of life-events to predict human lives},
  author={Savcisens, Germans and Eliassi-Rad, Tina and Hansen, Lars Kai and Mortensen, Laust Hvas and Lilleholt, Lau and Rogers, Anna and Zettler, Ingo and Lehmann, Sune},
  journal={Nature computational science},
  volume={4},
  number={1},
  pages={43--56},
  year={2024},
  publisher={Nature Publishing Group US New York}
}

Acknowledgments

Inspired by the data processing pipeline from Life2Vec and Life2Vec-Light
Built with Polars and Pydantic.

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

License

MIT License: see LICENSE file for details.

Support

🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

con.ver.ge

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

0.1.5

Apr 3, 2026

0.1.4

Feb 24, 2026

0.1.2

Feb 24, 2026

0.1.1

Feb 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tab2seq-0.1.5.tar.gz (54.3 kB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tab2seq-0.1.5-py3-none-any.whl (44.0 kB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file tab2seq-0.1.5.tar.gz.

File metadata

Download URL: tab2seq-0.1.5.tar.gz
Upload date: Apr 3, 2026
Size: 54.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tab2seq-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`c340593d38ca22174ffbee7d33cdce1c5e651b4644001b0555c253d165f2cd95`
MD5	`1c295f1593403b589a67c1cb7a93f8c3`
BLAKE2b-256	`d1ca3398f6843d6fa367e85129512ac92bfbe5a77b032cb81a619f16a43d5f17`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tab2seq-0.1.5.tar.gz:

Publisher: publish.yml on carlomarxdk/tab2seq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tab2seq-0.1.5.tar.gz
- Subject digest: c340593d38ca22174ffbee7d33cdce1c5e651b4644001b0555c253d165f2cd95
- Sigstore transparency entry: 1229252827
- Sigstore integration time: Apr 3, 2026
Source repository:
- Permalink: carlomarxdk/tab2seq@2b6e4b900a15c9bd3f36fc5464e75091773d0a33
- Branch / Tag: refs/tags/v0.1.5
- Owner: https://github.com/carlomarxdk
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2b6e4b900a15c9bd3f36fc5464e75091773d0a33
- Trigger Event: release

File details

Details for the file tab2seq-0.1.5-py3-none-any.whl.

File metadata

Download URL: tab2seq-0.1.5-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 44.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tab2seq-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`97bd608682b697a73f70fb853eab79aa0cd01a38003fc4d324b7fb57fcec5020`
MD5	`e0ffda6d3087db00915e53185c66fa11`
BLAKE2b-256	`ea522e7e088248e0a2baf35a23baa6d88f817bc0eacc637049ff290b13405a90`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tab2seq-0.1.5-py3-none-any.whl:

Publisher: publish.yml on carlomarxdk/tab2seq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tab2seq-0.1.5-py3-none-any.whl
- Subject digest: 97bd608682b697a73f70fb853eab79aa0cd01a38003fc4d324b7fb57fcec5020
- Sigstore transparency entry: 1229252865
- Sigstore integration time: Apr 3, 2026
Source repository:
- Permalink: carlomarxdk/tab2seq@2b6e4b900a15c9bd3f36fc5464e75091773d0a33
- Branch / Tag: refs/tags/v0.1.5
- Owner: https://github.com/carlomarxdk
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2b6e4b900a15c9bd3f36fc5464e75091773d0a33
- Trigger Event: release

tab2seq 0.1.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tab2seq

About

Pipeline Overview

Features

Installation

Quick Start

1. Generate Synthetic Data

2. Define Sources

3. Build a Cohort

4. Fit a Vocabulary (Train Only)

5. Build Tokenized Event Datasets

Retrieving Entity Records

Synthetic Registries

Use Cases

TODOs

Citation

Acknowledgments

Contributing

License

Support

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance