Transform tabular event data into sequences ready for Transformer and Sequential models: Life2Vec, BEHRT and more.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

con.ver.ge

These details have not been verified by PyPI

Project links

Documentation

Project description

tab2seq

tab2seq adapts the Life2Vec data processing pipeline to make it easy to work with multi-source tabular event data for sequential modeling projects. Transform registry data, EHR records, and other event-based datasets into formats ready for Transformer and sequential deep learning models.

[!WARNING] This is an alpha package. In the beta version, it will reimplement all the data-preprocessing steps of the life2vec and life2vec-light repos. See TODOs to see what is implemented at this point.

About

This package extracts and generalizes the data processing patterns from the Life2Vec project, making them reusable for similar research projects that need to:

Work with multiple longitudinal data sources (registries, databases)
Define and filter cohorts based on complex criteria
Generate realistic synthetic data for development and testing
Process large-scale tabular event data efficiently

Whether you're working with healthcare data, financial records, or any time-stamped event data, tab2seq provides the building blocks for preparing data for Life2Vec-style sequential models.

Features

Multi-Source Data Management: Handle multiple data sources (registries) with unified schema
Type-Safe Configuration: Pydantic-based configuration with YAML support
Synthetic Data Generation: Generate realistic dummy registry data for testing and exploration
Memory-Efficient Loading: Chunked iteration and lazy loading with Polars
Schema Validation: Automatic validation of entity IDs, timestamps, and column types
Cross-Source Operations: Unified access and operations across multiple data sources

Installation

# Basic installation
pip install tab2seq

Quick Start

Working with a Single Source

from tab2seq.source import (
    Source, 
    SourceConfig, 
    SourceCollection, 
    CategoricalColConfig, 
    ContinuousColConfig, 
    TimestampColConfig
)

config = SourceConfig(
    name="health",
    filepath="synthetic_data/health.parquet",
    id_col="entity_id",
    categorical_cols=[
        CategoricalColConfig(col_name="diagnosis", prefix="DIAG"),
        CategoricalColConfig(col_name="procedure", prefix="PROC"),
        CategoricalColConfig(col_name="department", prefix="DEPT"),
    ],
    continuous_cols=[
        ContinuousColConfig(col_name="cost", prefix="COST", n_bins=20, strategy="quantile"),
        ContinuousColConfig(col_name="length_of_stay", prefix="LOS", n_bins=20, strategy="quantile"),
    ],
    output_format="parquet",
    timestamp_cols=[
        TimestampColConfig(col_name="date", is_primary=True, drop_na=True)
    ]
)

source = Source(config=config)

# Process and tokenize the columns
print("Number of unique IDs:", len(source.get_entity_ids()))
lf_health = source.process(cache=True)
lf_health.head()

Working with Multiple Sources

from tab2seq.source import SourceCollection, SourceConfig, CategoricalColConfig, ContinuousColConfig, TimestampColConfig

# Define your data sources
configs = [
    SourceConfig(
        name="health",
        filepath="synthetic_data/health.parquet",
        id_col="entity_id",
        categorical_cols=[
            CategoricalColConfig(col_name="diagnosis", prefix="DIAG"),
            CategoricalColConfig(col_name="procedure", prefix="PROC"),
            CategoricalColConfig(col_name="department", prefix="DEPT"),
        ],
        continuous_cols=[
            ContinuousColConfig(col_name="cost", prefix="COST", n_bins=20, strategy="quantile"),
            ContinuousColConfig(col_name="length_of_stay", prefix="LOS", n_bins=20, strategy="quantile"),
        ],
        output_format="parquet",
        timestamp_cols=[
            TimestampColConfig(col_name="date", is_primary=True, drop_na=True)
        ]
    ),
    SourceConfig(
        name="labour",
        filepath="synthetic_data/labour.parquet",
        id_col="entity_id",
        categorical_cols=[
            CategoricalColConfig(col_name="status", prefix="STATUS"),
            CategoricalColConfig(col_name="occupation", prefix="OCC"),
            CategoricalColConfig(col_name="residence_region", prefix="REGION"),
        ],
        continuous_cols=[
            ContinuousColConfig(col_name="weekly_hours", prefix="WEEKLY_HOURS")
        ],
        output_format="parquet",
        timestamp_cols=[
            TimestampColConfig(col_name="date", is_primary=True, drop_na=True),
            TimestampColConfig(col_name="birthday", is_primary=False, drop_na=True),
        ],
    ),
]

# Create a source collection
collection = SourceCollection.from_configs(configs)

# Access individual sources
health = collection["health"]
df = health.read_all()

# Or iterate over all sources
for source in collection:
    print(f"{source.name}: {len(source.get_entity_ids())} entities")

# Cross-source operations
all_entity_ids = collection.get_all_entity_ids()

Generating Synthetic Data

from tab2seq.datasets import generate_synthetic_data
import polars as pl

# Generate synthetic registry data
data_paths = generate_synthetic_data(output_dir="synthetic_data", 
                                     n_entities=10000, 
                                     seed=742, 
                                     registries=["health", "labour", "survey", "income"],
                                     file_format="parquet")

lf_health = pl.read_parquet(data_paths["health"])
lf_health.head()

Architecture

[!warning] Work in progress!

Available Registries:

health: Medical events with diagnoses (ICD codes), procedures, departments, costs, and length of stay
income: Yearly income records with income type, sector, and amounts
labour: Quarterly labour status with occupation, employment status, and residence
survey: Periodic survey responses with education level, marital status, and satisfaction scores

All synthetic data includes realistic temporal patterns, missing data, and correlations between fields to mimic real-world registry data.

Use Cases

Healthcare Research: Transform electronic health records (EHR) into sequences for predictive modeling
Registry Data Processing: Work with multiple event-based registries (health, income, labour, surveys)
Sequential Modeling: Prepare multi-source data for Life2Vec, BEHRT, or other transformer-based models
Data Pipeline Development: Use synthetic data to develop and test processing pipelines before working with sensitive real data
Multi-Source Analysis: Combine and analyze data from multiple longitudinal sources with unified tooling

Development

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=tab2seq --cov-report=html

# Format code
black src/tab2seq tests

# Lint code
ruff check src/tab2seq tests

TODOs

Synthetic Datasets
Source implementation
Cohort implementation
Cohort and data splits
Tokenization implementation
Vocabulary implementation
Caching and chunking
Documentation

Citation

If you use this package in your research, please cite:

@software{tab2seq2026,
  author = {Savcisens, Germans},
  title = {tab2seq: Scalable Tabular to Sequential Data Processing},
  year = {2026},
  url = {https://github.com/carlomarxdk/tab2seq}
}

And the original Life2Vec paper that inspired this work:

@article{savcisens2024using,
  title={Using sequences of life-events to predict human lives},
  author={Savcisens, Germans and Eliassi-Rad, Tina and Hansen, Lars Kai and Mortensen, Laust Hvas and Lilleholt, Lau and Rogers, Anna and Zettler, Ingo and Lehmann, Sune},
  journal={Nature computational science},
  volume={4},
  number={1},
  pages={43--56},
  year={2024},
  publisher={Nature Publishing Group US New York}
}

Acknowledgments

Inspired by the data processing pipeline from Life2Vec and Life2Vec-Light
Built with Polars and Pydantic.

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

License

MIT License - see LICENSE file for details.

Support

🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

con.ver.ge

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

0.1.5

Apr 3, 2026

0.1.4

Feb 24, 2026

This version

0.1.2

Feb 24, 2026

0.1.1

Feb 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tab2seq-0.1.2.tar.gz (25.3 kB view details)

Uploaded Feb 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tab2seq-0.1.2-py3-none-any.whl (19.0 kB view details)

Uploaded Feb 24, 2026 Python 3

File details

Details for the file tab2seq-0.1.2.tar.gz.

File metadata

Download URL: tab2seq-0.1.2.tar.gz
Upload date: Feb 24, 2026
Size: 25.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tab2seq-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`be7a1150de6a9e5c5265c97764cd712926d0479c6c003c5783f8c7371eb9002a`
MD5	`707966fdae9ae5d7e0aa7babc0ddf25b`
BLAKE2b-256	`c4c167c5ece64d365d04db83df22857e8b6fede06693002e16cdbe17959edc54`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tab2seq-0.1.2.tar.gz:

Publisher: publish.yml on carlomarxdk/tab2seq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tab2seq-0.1.2.tar.gz
- Subject digest: be7a1150de6a9e5c5265c97764cd712926d0479c6c003c5783f8c7371eb9002a
- Sigstore transparency entry: 984636078
- Sigstore integration time: Feb 24, 2026
Source repository:
- Permalink: carlomarxdk/tab2seq@69e8ec1905babdce74ff3fc7a63f6afca9018f79
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/carlomarxdk
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@69e8ec1905babdce74ff3fc7a63f6afca9018f79
- Trigger Event: release

File details

Details for the file tab2seq-0.1.2-py3-none-any.whl.

File metadata

Download URL: tab2seq-0.1.2-py3-none-any.whl
Upload date: Feb 24, 2026
Size: 19.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tab2seq-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d424ef5f61f8742dc648f28143c25ce0e8cce6f8824a7c5e81ff256b54683129`
MD5	`d1b5ff2040a5706a5af2c3683b7e3826`
BLAKE2b-256	`f3887af8c17c87e929954f1a4fc3f95b50ef8ab9184b9473b6d2cb527479950f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tab2seq-0.1.2-py3-none-any.whl:

Publisher: publish.yml on carlomarxdk/tab2seq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tab2seq-0.1.2-py3-none-any.whl
- Subject digest: d424ef5f61f8742dc648f28143c25ce0e8cce6f8824a7c5e81ff256b54683129
- Sigstore transparency entry: 984636089
- Sigstore integration time: Feb 24, 2026
Source repository:
- Permalink: carlomarxdk/tab2seq@69e8ec1905babdce74ff3fc7a63f6afca9018f79
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/carlomarxdk
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@69e8ec1905babdce74ff3fc7a63f6afca9018f79
- Trigger Event: release

tab2seq 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tab2seq

About

Features

Installation

Quick Start

Working with a Single Source

Working with Multiple Sources

Generating Synthetic Data

Architecture

Use Cases

Development

TODOs

Citation

Acknowledgments

Contributing

License

Support

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance