Skip to main content

Generate synthetic data for data science, machine learning, and statistics.

Project description

Biomedical Data Generator

PyPI version Python Version Documentation Tests codecov

Generate reproducible synthetic biomedical datasets with known ground truth for teaching, benchmarking, and method development in high-dimensional machine learning settings.


Why This Package?

Biomedical machine learning operates in challenging p >> n regimes (thousands of features, dozens of samples). This generator creates synthetic datasets that mimic real-world complexity while providing complete ground truth:

  • Teaching: Demonstrate cross-validation pitfalls, feature selection stability, and batch effect impacts
  • Benchmarking: Compare feature selection methods with known informative features
  • Research: Develop and validate new algorithms with controlled data properties
  • Reproducibility: Deterministic generation for consistent educational materials

Compared to generic ML generators such as sklearn.datasets.make_classification, this package adds biomedical-specific structure: class-specific correlations, explicit batch effects, and rich metadata that records the full generative process (informative features, noise, correlated clusters, batch labels, configuration).


Key Features

Class-specific correlations – Simulate pathway activation only in disease states
Batch effects – Model technical variation with controllable confounding
Correlated feature clusters – Equicorrelated and Toeplitz structures
Flexible class balance – Exact sample counts per class
Ground-truth metadata – Complete generative process documentation
scikit-learn compatible – Seamless integration with ML pipelines


Installation

pip install biomedical-data-generator

Minimum Requirements: Python 3.11+


Quick Start

Basic Dataset

from biomedical_data_generator import DatasetConfig, ClassConfig, generate_dataset

cfg = DatasetConfig(
    n_informative=5,
    n_noise=10,
    class_configs=[
        ClassConfig(n_samples=50, label="healthy"),
        ClassConfig(n_samples=50, label="diseased"),
    ],
    class_sep=1.5,
    random_state=42,
)

X, y, meta = generate_dataset(cfg)
print(f"Dataset shape: {X.shape}")
print(f"True informative features: {len(meta.informative_idx)}")

Here, y contains integer-encoded class labels (0, 1, ...).
If you provide human-readable labels via ClassConfig(label=...), these are stored in the metadata for later interpretation.

Class-Specific Correlations

Simulate biomarkers that only correlate in diseased patients:

from biomedical_data_generator import DatasetConfig, ClassConfig, CorrClusterConfig, generate_dataset

cfg = DatasetConfig(
    n_informative=3,
    n_noise=5,
    class_configs=[
        ClassConfig(n_samples=100, label="healthy"),
        ClassConfig(n_samples=100, label="diseased"),
    ],
    corr_clusters=[
        CorrClusterConfig(
            n_cluster_features=6,
            correlation=0.2,            # baseline correlation
            class_correlation={1: 0.9}, # strong correlation in diseased class
            structure="equicorrelated",
            anchor_role="informative",
            anchor_effect_size="medium",
        )
    ],
    random_state=42,
)

X, y, meta = generate_dataset(cfg)

Batch Effects

Model recruitment bias and technical variation:

from biomedical_data_generator import DatasetConfig, ClassConfig, BatchEffectsConfig, generate_dataset

cfg = DatasetConfig(
    n_informative=5,
    n_noise=10,
    class_configs=[
        ClassConfig(n_samples=100, label="control"),
        ClassConfig(n_samples=100, label="disease"),
    ],
    batch_effects=BatchEffectsConfig(
        n_batches=3,
        effect_type="additive",
        effect_strength=0.5,
        confounding_with_class=0.7,  # recruitment bias
    ),
    random_state=42,
)

X, y, meta = generate_dataset(cfg)
print(f"Batch labels: {meta.batch_labels}")

Documentation

📖 Full documentation: https://sigrun-may.github.io/biomedical-data-generator/


Use Cases

Educational Applications

Ideal for teaching machine learning in biomedical contexts:

  • Feature selection stability across resampling splits
  • Cross-validation pitfalls in p >> n settings
  • Batch effect impacts on model generalization
  • Correlated features and interpretability challenges

The package is complemented by Jupyter-based teaching materials (OER) that guide learners through dataset generation, visualization, and evaluation.

Research & Benchmarking

Systematic method comparison with known ground truth:

  • Feature selection algorithm evaluation
  • Model performance under varying signal-to-noise ratios
  • Robustness testing with correlated features
  • Batch correction method validation

Scientific Context

Biomedical datasets present unique challenges:

  • High dimensionality: p >> n creates overfitting risks
  • Correlated features: Biological pathways create feature clusters
  • Batch effects: Multi-site and multi-batch studies introduce technical variation
  • Class imbalance: Disease prevalence varies widely

This generator provides realistic synthetic data that captures these properties while maintaining complete ground truth for validation. This is particularly useful when real datasets are too small, protected, or lack clear ground truth about causal vs. non-causal structure.


Architecture

The generator is implemented as a six-phase pipeline with single-responsibility modules:

  1. Label generation → Exact class counts (DatasetConfig.class_configs)
  2. Informative features → Class-separated signals
  3. Correlated clusters → Pathway-like structures with configurable correlation patterns
  4. Noise features → Independent distractors
  5. Assembly → Concatenation of all feature blocks into a single matrix
  6. Batch effects (optional) → Additive or multiplicative technical overlays, optionally confounded with class

Internally, the code is organized into dedicated modules for configuration, feature generation (informative, correlated, noise), batch effects, and metadata. A single random number generator drives the complete pipeline to ensure reproducibility.

The returned DatasetMeta object provides:

  • Indices of informative features (e.g. meta.informative_idx)
  • Indices of pure-noise features
  • Indices or groupings of correlated feature clusters
  • Class and batch labels
  • A structured record of the configuration and random seeds used

This enables precise validation of feature selection and model behavior.


Examples

The examples/ directory contains complete demonstrations:

  • 01_basic_usage.py – Simple dataset generation
  • 02_batch_effects.py – Technical variation simulation
  • 03_class_specific_correlations.py – Disease-specific pathway activation
  • 04_feature_selection_stability.py – Benchmarking feature selection methods

Run any example:

python examples/01_basic_usage.py

Command-Line Interface

Generate datasets from YAML configuration:

bdg --config my_config.yaml --out dataset.csv

Example my_config.yaml:

n_informative: 5
n_noise: 10
class_configs:
  - n_samples: 50
    label: "control"
  - n_samples: 50
    label: "disease"
class_sep: 1.5
random_state: 42

Run bdg --help to see all available options.


Testing & Quality

The project includes a pytest-based test suite that covers:

  • Informative feature generation
  • Correlated feature clusters and target correlation structures
  • Batch effect configurations and label generation
  • The scikit-learn compatible interface

Tests are designed to ensure numerical stability, reproducibility, and consistency of the public API across releases.


Citation

If you use this package in scientific work, please cite:

@software{biomedical_data_generator,
  author       = {May, Sigrun},
  title        = {biomedical-data-generator: Synthetic biomedical data
                  generator for benchmarking and teaching},
  year         = {2025},
  url          = {https://github.com/sigrun-may/biomedical-data-generator},
  version      = {1.0.0}
}

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Add tests for new functionality
  4. Ensure all tests pass (pytest)
  5. Submit a pull request

See CONTRIBUTING.md for detailed guidelines.


License

This project is licensed under the MIT License – see LICENSE for details.


Acknowledgments

Developed at TU Braunschweig, TU Clausthal, and Ostfalia University with support from BMBF and the State of Lower Saxony.

The project fills gaps in existing synthetic data generators by providing:

  • A unified framework for class-specific correlations
  • Integrated batch effect simulation
  • An educational focus with extensive documentation
  • Complete ground truth metadata for validation

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biomedical_data_generator-1.0.0.tar.gz (48.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biomedical_data_generator-1.0.0-py3-none-any.whl (55.1 kB view details)

Uploaded Python 3

File details

Details for the file biomedical_data_generator-1.0.0.tar.gz.

File metadata

File hashes

Hashes for biomedical_data_generator-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a249953a2dead01096f7935334e2cfd7ec312d5f0359274d19d461a015442f2d
MD5 39873dc1a01376d0466f46634e6aafdc
BLAKE2b-256 27e4e9f2a72ce19597537618264820479af2cf82beb10d4bf30b3fbce7025a2c

See more details on using hashes here.

Provenance

The following attestation bundles were made for biomedical_data_generator-1.0.0.tar.gz:

Publisher: pypi_upload.yml on sigrun-may/biomedical-data-generator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file biomedical_data_generator-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for biomedical_data_generator-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 866350838af293e1085d29af1df0ec497dbe6679c4d2786b54fccfe0f70125b7
MD5 44a792b58178d957294c89ea31c44562
BLAKE2b-256 7d6ae10e77262e4101500d18574e833d6743c76a623c7d423572f40c025d2092

See more details on using hashes here.

Provenance

The following attestation bundles were made for biomedical_data_generator-1.0.0-py3-none-any.whl:

Publisher: pypi_upload.yml on sigrun-may/biomedical-data-generator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page