Generate synthetic data for data science, machine learning, and statistics.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Biomedical Data Generator

Generate reproducible synthetic biomedical datasets with known ground truth for teaching, benchmarking, and method development in high-dimensional machine learning settings.

Why This Package?

Biomedical machine learning operates in challenging p >> n regimes (thousands of features, dozens of samples). This generator creates synthetic datasets that mimic real-world complexity while providing complete ground truth:

Teaching: Demonstrate cross-validation pitfalls, feature selection stability, and batch effect impacts
Benchmarking: Compare feature selection methods with known informative features
Research: Develop and validate new algorithms with controlled data properties
Reproducibility: Deterministic generation for consistent educational materials

Compared to generic ML generators such as sklearn.datasets.make_classification, this package adds biomedical-specific structure: class-specific correlations, explicit batch effects, and rich metadata that records the full generative process (informative features, noise, correlated clusters, batch labels, configuration).

Key Features

✅ Class-specific correlations – Simulate pathway activation only in disease states
✅ Batch effects – Model technical variation with controllable confounding
✅ Correlated feature clusters – Equicorrelated and Toeplitz structures
✅ Flexible class balance – Exact sample counts per class
✅ Ground-truth metadata – Complete generative process documentation
✅ scikit-learn compatible – Seamless integration with ML pipelines

Installation

pip install biomedical-data-generator

Minimum Requirements: Python 3.11+

Quick Start

Basic Dataset

from biomedical_data_generator import DatasetConfig, ClassConfig, generate_dataset

cfg = DatasetConfig(
    n_informative=5,
    n_noise=10,
    class_configs=[
        ClassConfig(n_samples=50, label="healthy"),
        ClassConfig(n_samples=50, label="diseased"),
    ],
    class_sep=1.5,
    random_state=42,
)

X, y, meta = generate_dataset(cfg)
print(f"Dataset shape: {X.shape}")
print(f"True informative features: {len(meta.informative_idx)}")

Here, y contains integer-encoded class labels (0, 1, ...).
If you provide human-readable labels via ClassConfig(label=...), these are stored in the metadata for later interpretation.

Class-Specific Correlations

Simulate biomarkers that only correlate in diseased patients:

from biomedical_data_generator import DatasetConfig, ClassConfig, CorrClusterConfig, generate_dataset

cfg = DatasetConfig(
    n_informative=3,
    n_noise=5,
    class_configs=[
        ClassConfig(n_samples=100, label="healthy"),
        ClassConfig(n_samples=100, label="diseased"),
    ],
    corr_clusters=[
        CorrClusterConfig(
            n_cluster_features=6,
            correlation=0.2,            # baseline correlation
            class_correlation={1: 0.9}, # strong correlation in diseased class
            structure="equicorrelated",
            anchor_role="informative",
            anchor_effect_size="medium",
        )
    ],
    random_state=42,
)

X, y, meta = generate_dataset(cfg)

Batch Effects

Model recruitment bias and technical variation:

from biomedical_data_generator import DatasetConfig, ClassConfig, BatchEffectsConfig, generate_dataset

cfg = DatasetConfig(
    n_informative=5,
    n_noise=10,
    class_configs=[
        ClassConfig(n_samples=100, label="control"),
        ClassConfig(n_samples=100, label="disease"),
    ],
    batch_effects=BatchEffectsConfig(
        n_batches=3,
        effect_type="additive",
        effect_strength=0.5,
        confounding_with_class=0.7,  # recruitment bias
    ),
    random_state=42,
)

X, y, meta = generate_dataset(cfg)
print(f"Batch labels: {meta.batch_labels}")

Documentation

📖 Full documentation: https://sigrun-may.github.io/biomedical-data-generator/

Use Cases

Educational Applications

Ideal for teaching machine learning in biomedical contexts:

Feature selection stability across resampling splits
Cross-validation pitfalls in p >> n settings
Batch effect impacts on model generalization
Correlated features and interpretability challenges

The package is complemented by Jupyter-based teaching materials (OER) that guide learners through dataset generation, visualization, and evaluation.

Research & Benchmarking

Systematic method comparison with known ground truth:

Feature selection algorithm evaluation
Model performance under varying signal-to-noise ratios
Robustness testing with correlated features
Batch correction method validation

Scientific Context

Biomedical datasets present unique challenges:

High dimensionality: p >> n creates overfitting risks
Correlated features: Biological pathways create feature clusters
Batch effects: Multi-site and multi-batch studies introduce technical variation
Class imbalance: Disease prevalence varies widely

This generator provides realistic synthetic data that captures these properties while maintaining complete ground truth for validation. This is particularly useful when real datasets are too small, protected, or lack clear ground truth about causal vs. non-causal structure.

Architecture

The generator is implemented as a six-phase pipeline with single-responsibility modules:

Label generation → Exact class counts (DatasetConfig.class_configs)
Informative features → Class-separated signals
Correlated clusters → Pathway-like structures with configurable correlation patterns
Noise features → Independent distractors
Assembly → Concatenation of all feature blocks into a single matrix
Batch effects (optional) → Additive or multiplicative technical overlays, optionally confounded with class

Internally, the code is organized into dedicated modules for configuration, feature generation (informative, correlated, noise), batch effects, and metadata. A single random number generator drives the complete pipeline to ensure reproducibility.

The returned DatasetMeta object provides:

Indices of informative features (e.g. meta.informative_idx)
Indices of pure-noise features
Indices or groupings of correlated feature clusters
Class and batch labels
A structured record of the configuration and random seeds used

This enables precise validation of feature selection and model behavior.

Examples

The examples/ directory contains complete demonstrations:

01_basic_usage.py – Simple dataset generation
02_batch_effects.py – Technical variation simulation
03_class_specific_correlations.py – Disease-specific pathway activation
04_feature_selection_stability.py – Benchmarking feature selection methods

Run any example:

python examples/01_basic_usage.py

Command-Line Interface

Generate datasets from YAML configuration:

bdg --config my_config.yaml --out dataset.csv

Example my_config.yaml:

n_informative: 5
n_noise: 10
class_configs:
  - n_samples: 50
    label: "control"
  - n_samples: 50
    label: "disease"
class_sep: 1.5
random_state: 42

Run bdg --help to see all available options.

Testing & Quality

The project includes a pytest-based test suite that covers:

Informative feature generation
Correlated feature clusters and target correlation structures
Batch effect configurations and label generation
The scikit-learn compatible interface

Tests are designed to ensure numerical stability, reproducibility, and consistency of the public API across releases.

Citation

If you use this package in scientific work, please cite:

@software{biomedical_data_generator,
  author       = {May, Sigrun},
  title        = {biomedical-data-generator: Synthetic biomedical data
                  generator for benchmarking and teaching},
  year         = {2025},
  url          = {https://github.com/sigrun-may/biomedical-data-generator},
  version      = {1.0.0}
}

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Add tests for new functionality
Ensure all tests pass (pytest)
Submit a pull request

See CONTRIBUTING.md for detailed guidelines.

License

This project is licensed under the MIT License – see LICENSE for details.

Acknowledgments

Developed at TU Braunschweig, TU Clausthal, and Ostfalia University with support from BMBF and the State of Lower Saxony.

The project fills gaps in existing synthetic data generators by providing:

A unified framework for class-specific correlations
Integrated batch effect simulation
An educational focus with extensive documentation
Complete ground truth metadata for validation

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Sigrun-May

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

Nov 30, 2025

0.1.4

Oct 6, 2025

0.1.3

Sep 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biomedical_data_generator-1.0.0.tar.gz (48.3 kB view details)

Uploaded Nov 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

biomedical_data_generator-1.0.0-py3-none-any.whl (55.1 kB view details)

Uploaded Nov 30, 2025 Python 3

File details

Details for the file biomedical_data_generator-1.0.0.tar.gz.

File metadata

Download URL: biomedical_data_generator-1.0.0.tar.gz
Upload date: Nov 30, 2025
Size: 48.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for biomedical_data_generator-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`a249953a2dead01096f7935334e2cfd7ec312d5f0359274d19d461a015442f2d`
MD5	`39873dc1a01376d0466f46634e6aafdc`
BLAKE2b-256	`27e4e9f2a72ce19597537618264820479af2cf82beb10d4bf30b3fbce7025a2c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for biomedical_data_generator-1.0.0.tar.gz:

Publisher: pypi_upload.yml on sigrun-may/biomedical-data-generator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: biomedical_data_generator-1.0.0.tar.gz
- Subject digest: a249953a2dead01096f7935334e2cfd7ec312d5f0359274d19d461a015442f2d
- Sigstore transparency entry: 731934379
- Sigstore integration time: Nov 30, 2025
Source repository:
- Permalink: sigrun-may/biomedical-data-generator@27323730d64fa561cc490d01b5b3a117d1f65b6a
- Branch / Tag: refs/tags/1.0.0
- Owner: https://github.com/sigrun-may
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi_upload.yml@27323730d64fa561cc490d01b5b3a117d1f65b6a
- Trigger Event: push

File details

Details for the file biomedical_data_generator-1.0.0-py3-none-any.whl.

File metadata

Download URL: biomedical_data_generator-1.0.0-py3-none-any.whl
Upload date: Nov 30, 2025
Size: 55.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for biomedical_data_generator-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`866350838af293e1085d29af1df0ec497dbe6679c4d2786b54fccfe0f70125b7`
MD5	`44a792b58178d957294c89ea31c44562`
BLAKE2b-256	`7d6ae10e77262e4101500d18574e833d6743c76a623c7d423572f40c025d2092`

See more details on using hashes here.

Provenance

The following attestation bundles were made for biomedical_data_generator-1.0.0-py3-none-any.whl:

Publisher: pypi_upload.yml on sigrun-may/biomedical-data-generator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: biomedical_data_generator-1.0.0-py3-none-any.whl
- Subject digest: 866350838af293e1085d29af1df0ec497dbe6679c4d2786b54fccfe0f70125b7
- Sigstore transparency entry: 731934380
- Sigstore integration time: Nov 30, 2025
Source repository:
- Permalink: sigrun-may/biomedical-data-generator@27323730d64fa561cc490d01b5b3a117d1f65b6a
- Branch / Tag: refs/tags/1.0.0
- Owner: https://github.com/sigrun-may
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi_upload.yml@27323730d64fa561cc490d01b5b3a117d1f65b6a
- Trigger Event: push

biomedical-data-generator 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Biomedical Data Generator

Why This Package?

Key Features

Installation

Quick Start

Basic Dataset

Class-Specific Correlations

Batch Effects

Documentation

Use Cases

Educational Applications

Research & Benchmarking

Scientific Context

Architecture

Examples

Command-Line Interface

Testing & Quality

Citation

Contributing

License

Acknowledgments

Links

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance