Transform tabular event data into sequences ready for Transformer and Sequential models: Life2Vec, BEHRT and more.
Project description
tab2seq
tab2seq adapts the Life2Vec data processing pipeline to make it easy to work with multi-source tabular event data for sequential modeling projects. Transform registry data, EHR records, and other event-based datasets into formats ready for Transformer and sequential deep learning models.
[!WARNING] This is an alpha package. In the beta version, it will reimplement all the data-preprocessing steps of the life2vec and life2vec-light repos. See TODOs to see what is implemented at this point.
About
This package extracts and generalizes the data processing patterns from the Life2Vec project, making them reusable for similar research projects that need to:
- Work with multiple longitudinal data sources (registries, databases)
- Define and filter cohorts based on complex criteria
- Generate realistic synthetic data for development and testing
- Process large-scale tabular event data efficiently
Whether you're working with healthcare data, financial records, or any time-stamped event data, tab2seq provides the building blocks for preparing data for Life2Vec-style sequential models.
Features
- Multi-Source Data Management: Handle multiple data sources (registries) with unified schema
- Type-Safe Configuration: Pydantic-based configuration with YAML support
- Synthetic Data Generation: Generate realistic dummy registry data for testing and exploration
- Memory-Efficient Loading: Chunked iteration and lazy loading with Polars
- Schema Validation: Automatic validation of entity IDs, timestamps, and column types
- Cross-Source Operations: Unified access and operations across multiple data sources
Installation
# Basic installation
pip install tab2seq
# Development installation
pip install -e ".[dev]"
Quick Start
Working with Multiple Data Sources
from tab2seq.source import Source, SourceCollection, SourceConfig
# Define your data sources
configs = [
SourceConfig(
name="health",
filepath="data/health.parquet",
entity_id_col="patient_id",
timestamp_cols=["date"],
categorical_cols=["diagnosis", "procedure", "department"],
continuous_cols=["cost", "length_of_stay"],
),
SourceConfig(
name="income",
filepath="data/income.parquet",
entity_id_col="person_id",
timestamp_cols=["year"],
categorical_cols=["income_type", "sector"],
continuous_cols=["income_amount"],
),
]
# Create a source collection
collection = SourceCollection.from_configs(configs)
# Access individual sources
health = collection["health"]
df = health.read_all()
# Or iterate over all sources
for source in collection:
print(f"{source.name}: {len(source.get_entity_ids())} entities")
# Cross-source operations
all_entity_ids = collection.get_all_entity_ids()
Generating Synthetic Data
from tab2seq.datasets import generate_synthetic_collections
# Generate synthetic registry data for testing
collection = generate_synthetic_collections(
output_dir="data/dummy",
n_entities=1000,
seed=42
)
# Returns a ready-to-use SourceCollection
health = collection["health"]
print(health.read_all().head())
Architecture
[!warning] Work in progress!
Available Registries:
- health: Medical events with diagnoses (ICD codes), procedures, departments, costs, and length of stay
- income: Yearly income records with income type, sector, and amounts
- labour: Quarterly labour status with occupation, employment status, and residence
- survey: Periodic survey responses with education level, marital status, and satisfaction scores
All synthetic data includes realistic temporal patterns, missing data, and correlations between fields to mimic real-world registry data.
Use Cases
- Healthcare Research: Transform electronic health records (EHR) into sequences for predictive modeling
- Registry Data Processing: Work with multiple event-based registries (health, income, labour, surveys)
- Sequential Modeling: Prepare multi-source data for Life2Vec, BEHRT, or other transformer-based models
- Data Pipeline Development: Use synthetic data to develop and test processing pipelines before working with sensitive real data
- Multi-Source Analysis: Combine and analyze data from multiple longitudinal sources with unified tooling
Development
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=tab2seq --cov-report=html
# Format code
black src/tab2seq tests
# Lint code
ruff check src/tab2seq tests
TODOs
- Synthetic Datasets
-
Sourceimplementation -
Cohortimplementation -
Cohortand data splits -
Tokenizationimplementation -
Vocabularyimplementation - Caching and chunking
Citation
If you use this package in your research, please cite:
@software{tab2seq2024,
author = {Savcisens, Germans},
title = {tab2seq: Scalable Tabular to Sequential Data Processing},
year = {2024},
url = {https://github.com/carlomarxdk/tab2seq}
}
And the original Life2Vec paper that inspired this work:
@article{savcisens2024using,
title={Using sequences of life-events to predict human lives},
author={Savcisens, Germans and Eliassi-Rad, Tina and Hansen, Lars Kai and Mortensen, Laust Hvas and Lilleholt, Lau and Rogers, Anna and Zettler, Ingo and Lehmann, Sune},
journal={Nature computational science},
volume={4},
number={1},
pages={43--56},
year={2024},
publisher={Nature Publishing Group US New York}
}
Acknowledgments
- Inspired by the data processing pipeline from Life2Vec and Life2Vec-Light
- Built with Polars, PyArrow, Pydantic, and Joblib
Contributing
Contributions are welcome! Please open an issue or submit a pull request on GitHub.
License
MIT License - see LICENSE file for details.
Support
- 🐛 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tab2seq-0.1.1.tar.gz.
File metadata
- Download URL: tab2seq-0.1.1.tar.gz
- Upload date:
- Size: 22.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c9b4fa4f13c30d54b047bf5cf715ddc46f2667ed658cc091d274a18500b70d1
|
|
| MD5 |
11535ddbb46997eaa7ba22f65af3f761
|
|
| BLAKE2b-256 |
e8067222e9d06edaa67e0a8209423a8748206d0306d7fafd3bd27a6196c79e40
|
Provenance
The following attestation bundles were made for tab2seq-0.1.1.tar.gz:
Publisher:
publish.yml on carlomarxdk/tab2seq
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tab2seq-0.1.1.tar.gz -
Subject digest:
2c9b4fa4f13c30d54b047bf5cf715ddc46f2667ed658cc091d274a18500b70d1 - Sigstore transparency entry: 975467825
- Sigstore integration time:
-
Permalink:
carlomarxdk/tab2seq@896c23e82d27e8d80689d4fa2f3b797f030ce0f9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/carlomarxdk
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@896c23e82d27e8d80689d4fa2f3b797f030ce0f9 -
Trigger Event:
release
-
Statement type:
File details
Details for the file tab2seq-0.1.1-py3-none-any.whl.
File metadata
- Download URL: tab2seq-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
600ba81e593ca503c266747507afce15c374deb25c76428bd5899099d673fea6
|
|
| MD5 |
47741ad0b3a5478f6c7134a704f535a7
|
|
| BLAKE2b-256 |
d2aaad749f5f15f104e8fb26e21d5bffdca6844ef57efdcd133148b39987355d
|
Provenance
The following attestation bundles were made for tab2seq-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on carlomarxdk/tab2seq
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tab2seq-0.1.1-py3-none-any.whl -
Subject digest:
600ba81e593ca503c266747507afce15c374deb25c76428bd5899099d673fea6 - Sigstore transparency entry: 975467841
- Sigstore integration time:
-
Permalink:
carlomarxdk/tab2seq@896c23e82d27e8d80689d4fa2f3b797f030ce0f9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/carlomarxdk
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@896c23e82d27e8d80689d4fa2f3b797f030ce0f9 -
Trigger Event:
release
-
Statement type: