Generate synthetic data for data science, machine learning, and statistics.
Project description
Biomedical Data Generator
Generate reproducible synthetic biomedical datasets with known ground truth for teaching, benchmarking, and method development in high-dimensional machine learning settings.
Why This Package?
Biomedical machine learning operates in challenging p >> n regimes (thousands of features, dozens of samples). This generator creates synthetic datasets that mimic real-world complexity while providing complete ground truth:
- Teaching: Demonstrate cross-validation pitfalls, feature selection stability, and batch effect impacts
- Benchmarking: Compare feature selection methods with known informative features
- Research: Develop and validate new algorithms with controlled data properties
- Reproducibility: Deterministic generation for consistent educational materials
Compared to generic ML generators such as sklearn.datasets.make_classification, this package adds biomedical-specific structure: class-specific correlations, explicit batch effects, and rich metadata that records the full generative process (informative features, noise, correlated clusters, batch labels, configuration).
Key Features
✅ Class-specific correlations – Simulate pathway activation only in disease states
✅ Batch effects – Model technical variation with controllable confounding
✅ Correlated feature clusters – Equicorrelated and Toeplitz structures
✅ Flexible class balance – Exact sample counts per class
✅ Ground-truth metadata – Complete generative process documentation
✅ scikit-learn compatible – Seamless integration with ML pipelines
Installation
pip install biomedical-data-generator
Minimum Requirements: Python 3.11+
Quick Start
Basic Dataset
from biomedical_data_generator import DatasetConfig, ClassConfig, generate_dataset
cfg = DatasetConfig(
n_informative=5,
n_noise=10,
class_configs=[
ClassConfig(n_samples=50, label="healthy"),
ClassConfig(n_samples=50, label="diseased"),
],
class_sep=1.5,
random_state=42,
)
X, y, meta = generate_dataset(cfg)
print(f"Dataset shape: {X.shape}")
print(f"True informative features: {len(meta.informative_idx)}")
Here, y contains integer-encoded class labels (0, 1, ...).
If you provide human-readable labels via ClassConfig(label=...), these are stored in the metadata for later interpretation.
Class-Specific Correlations
Simulate biomarkers that only correlate in diseased patients:
from biomedical_data_generator import DatasetConfig, ClassConfig, CorrClusterConfig, generate_dataset
cfg = DatasetConfig(
n_informative=3,
n_noise=5,
class_configs=[
ClassConfig(n_samples=100, label="healthy"),
ClassConfig(n_samples=100, label="diseased"),
],
corr_clusters=[
CorrClusterConfig(
n_cluster_features=6,
correlation=0.2, # baseline correlation
class_correlation={1: 0.9}, # strong correlation in diseased class
structure="equicorrelated",
anchor_role="informative",
anchor_effect_size="medium",
)
],
random_state=42,
)
X, y, meta = generate_dataset(cfg)
Batch Effects
Model recruitment bias and technical variation:
from biomedical_data_generator import DatasetConfig, ClassConfig, BatchEffectsConfig, generate_dataset
cfg = DatasetConfig(
n_informative=5,
n_noise=10,
class_configs=[
ClassConfig(n_samples=100, label="control"),
ClassConfig(n_samples=100, label="disease"),
],
batch_effects=BatchEffectsConfig(
n_batches=3,
effect_type="additive",
effect_strength=0.5,
confounding_with_class=0.7, # recruitment bias
),
random_state=42,
)
X, y, meta = generate_dataset(cfg)
print(f"Batch labels: {meta.batch_labels}")
Documentation
📖 Full documentation: https://sigrun-may.github.io/biomedical-data-generator/
Use Cases
Educational Applications
Ideal for teaching machine learning in biomedical contexts:
- Feature selection stability across resampling splits
- Cross-validation pitfalls in p >> n settings
- Batch effect impacts on model generalization
- Correlated features and interpretability challenges
The package is complemented by Jupyter-based teaching materials (OER) that guide learners through dataset generation, visualization, and evaluation.
Research & Benchmarking
Systematic method comparison with known ground truth:
- Feature selection algorithm evaluation
- Model performance under varying signal-to-noise ratios
- Robustness testing with correlated features
- Batch correction method validation
Scientific Context
Biomedical datasets present unique challenges:
- High dimensionality: p >> n creates overfitting risks
- Correlated features: Biological pathways create feature clusters
- Batch effects: Multi-site and multi-batch studies introduce technical variation
- Class imbalance: Disease prevalence varies widely
This generator provides realistic synthetic data that captures these properties while maintaining complete ground truth for validation. This is particularly useful when real datasets are too small, protected, or lack clear ground truth about causal vs. non-causal structure.
Architecture
The generator is implemented as a six-phase pipeline with single-responsibility modules:
- Label generation → Exact class counts (
DatasetConfig.class_configs) - Informative features → Class-separated signals
- Correlated clusters → Pathway-like structures with configurable correlation patterns
- Noise features → Independent distractors
- Assembly → Concatenation of all feature blocks into a single matrix
- Batch effects (optional) → Additive or multiplicative technical overlays, optionally confounded with class
Internally, the code is organized into dedicated modules for configuration, feature generation (informative, correlated, noise), batch effects, and metadata. A single random number generator drives the complete pipeline to ensure reproducibility.
The returned DatasetMeta object provides:
- Indices of informative features (e.g.
meta.informative_idx) - Indices of pure-noise features
- Indices or groupings of correlated feature clusters
- Class and batch labels
- A structured record of the configuration and random seeds used
This enables precise validation of feature selection and model behavior.
Examples
The examples/ directory contains complete demonstrations:
- 01_basic_usage.py – Simple dataset generation
- 02_batch_effects.py – Technical variation simulation
- 03_class_specific_correlations.py – Disease-specific pathway activation
- 04_feature_selection_stability.py – Benchmarking feature selection methods
Run any example:
python examples/01_basic_usage.py
Command-Line Interface
Generate datasets from YAML configuration:
bdg --config my_config.yaml --out dataset.csv
Example my_config.yaml:
n_informative: 5
n_noise: 10
class_configs:
- n_samples: 50
label: "control"
- n_samples: 50
label: "disease"
class_sep: 1.5
random_state: 42
Run bdg --help to see all available options.
Testing & Quality
The project includes a pytest-based test suite that covers:
- Informative feature generation
- Correlated feature clusters and target correlation structures
- Batch effect configurations and label generation
- The scikit-learn compatible interface
Tests are designed to ensure numerical stability, reproducibility, and consistency of the public API across releases.
Citation
If you use this package in scientific work, please cite:
@software{biomedical_data_generator,
author = {May, Sigrun},
title = {biomedical-data-generator: Synthetic biomedical data
generator for benchmarking and teaching},
year = {2025},
url = {https://github.com/sigrun-may/biomedical-data-generator},
version = {1.0.0}
}
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Add tests for new functionality
- Ensure all tests pass (
pytest) - Submit a pull request
See CONTRIBUTING.md for detailed guidelines.
License
This project is licensed under the MIT License – see LICENSE for details.
Acknowledgments
Developed at TU Braunschweig, TU Clausthal, and Ostfalia University with support from BMBF and the State of Lower Saxony.
The project fills gaps in existing synthetic data generators by providing:
- A unified framework for class-specific correlations
- Integrated batch effect simulation
- An educational focus with extensive documentation
- Complete ground truth metadata for validation
Links
- Documentation: https://sigrun-may.github.io/biomedical-data-generator/
- PyPI Package: https://pypi.org/project/biomedical-data-generator/
- Issue Tracker: https://github.com/sigrun-may/biomedical-data-generator/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file biomedical_data_generator-1.0.0.tar.gz.
File metadata
- Download URL: biomedical_data_generator-1.0.0.tar.gz
- Upload date:
- Size: 48.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a249953a2dead01096f7935334e2cfd7ec312d5f0359274d19d461a015442f2d
|
|
| MD5 |
39873dc1a01376d0466f46634e6aafdc
|
|
| BLAKE2b-256 |
27e4e9f2a72ce19597537618264820479af2cf82beb10d4bf30b3fbce7025a2c
|
Provenance
The following attestation bundles were made for biomedical_data_generator-1.0.0.tar.gz:
Publisher:
pypi_upload.yml on sigrun-may/biomedical-data-generator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
biomedical_data_generator-1.0.0.tar.gz -
Subject digest:
a249953a2dead01096f7935334e2cfd7ec312d5f0359274d19d461a015442f2d - Sigstore transparency entry: 731934379
- Sigstore integration time:
-
Permalink:
sigrun-may/biomedical-data-generator@27323730d64fa561cc490d01b5b3a117d1f65b6a -
Branch / Tag:
refs/tags/1.0.0 - Owner: https://github.com/sigrun-may
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_upload.yml@27323730d64fa561cc490d01b5b3a117d1f65b6a -
Trigger Event:
push
-
Statement type:
File details
Details for the file biomedical_data_generator-1.0.0-py3-none-any.whl.
File metadata
- Download URL: biomedical_data_generator-1.0.0-py3-none-any.whl
- Upload date:
- Size: 55.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
866350838af293e1085d29af1df0ec497dbe6679c4d2786b54fccfe0f70125b7
|
|
| MD5 |
44a792b58178d957294c89ea31c44562
|
|
| BLAKE2b-256 |
7d6ae10e77262e4101500d18574e833d6743c76a623c7d423572f40c025d2092
|
Provenance
The following attestation bundles were made for biomedical_data_generator-1.0.0-py3-none-any.whl:
Publisher:
pypi_upload.yml on sigrun-may/biomedical-data-generator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
biomedical_data_generator-1.0.0-py3-none-any.whl -
Subject digest:
866350838af293e1085d29af1df0ec497dbe6679c4d2786b54fccfe0f70125b7 - Sigstore transparency entry: 731934380
- Sigstore integration time:
-
Permalink:
sigrun-may/biomedical-data-generator@27323730d64fa561cc490d01b5b3a117d1f65b6a -
Branch / Tag:
refs/tags/1.0.0 - Owner: https://github.com/sigrun-may
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_upload.yml@27323730d64fa561cc490d01b5b3a117d1f65b6a -
Trigger Event:
push
-
Statement type: