Skip to main content

Medical Synthetic Data Generator with Privacy-Preserving Synthesis

Project description

MedSynth: Medical Synthetic Data Generator with Privacy-Preserving Synthesis

Python 3.8+ License: MIT Code style: black

MedSynth is a Python package for generating synthetic medical images with configurable privacy protection. It creates realistic CT scans using statistical methods without requiring machine learning or GPU resources.

Key Features

Three Generation Modes:

  1. Pure Synthetic - Generate CT scans from scratch using procedural methods
  2. Augmentation - Standard augmentation of real CTs (rotation, scaling, noise)
  3. Privacy-Preserving Synthesis - Template-based synthesis with statistical privacy protection using Multi-Scale Statistical Texture Synthesis (MS-STS)

🔒 Privacy Protection (Mode 3 only):

  • Low mutual information (empirically < 1.8 bits on tested datasets)
  • Synthetic intensity remapping via gradient-preserving transformations
  • Designed to reduce re-identification risk

📊 Multiple Output Formats:

  • DICOM (medical imaging standard)
  • NRRD (3D visualization)
  • OMOP CDM (healthcare data standard)

Computational Efficiency:

  • No GPU or deep learning required
  • Runs on standard workstations
  • Approximately 67 seconds per 178-slice CT volume (hardware-dependent)

Performance Benchmarks

Evaluated on NSCLC-Radiomics dataset (November 26, 2025):

Privacy-Preserving Synthesis Mode (Optimized Parameters)

Metric Value Interpretation
SSIM (body region) 0.7880 Moderate structural similarity
SSIM (lung region) 0.9527 High structural similarity
Mutual Information 1.08 bits Low information leakage
Generation Speed 67 sec/volume Single-threaded on M-series Mac
Voxel Remapping 100% All intensities transformed

Optimized Parameters (15-iteration empirical search):

  • Frequency domain cutoff: 0.5456
  • Point spread function blur: 0.40
  • Texture noise standard deviation: 5.5449
  • Edge enhancement strength: 0.25

Note: Performance may vary on different datasets, scanners, and protocols. These benchmarks represent single-case optimization and should be validated on your specific data.


Installation

From PyPI (when published)

pip install medsynth

From Source

git clone https://github.com/ankurlohachab/medsynth.git
cd medsynth
pip install -e .

Development Installation

pip install -e ".[dev]"
pytest tests/

Quick Start

Command Line Interface

1. Pure Synthetic Generation

medsynth \
  --num-subjects 10 \
  --output-dir ./output/pure_synthetic/

2. Augmentation Mode

medsynth \
  --augment ./path/to/real/ct/dicom_folder/ \
  --num-subjects 5 \
  --output-dir ./output/augmented/

⚠️ Note: Augmentation mode does NOT provide privacy protection - use for data augmentation only.

3. Privacy-Preserving Synthesis Mode

medsynth \
  --privacy-synthesis \
  --augment ./path/to/real/ct/dicom_folder/ \
  --num-subjects 10 \
  --output-dir ./output/privacy_synthesis/

Python API

from medsynth.config import Config
from medsynth.pipeline import SyntheticCTPipeline

# Configure privacy-synthesis
config = Config(
    num_subjects=10,
    privacy_synth_mode=True,  # Privacy-synthesis using MS-STS
    augmentation_input="./path/to/real/ct",
    output_root="./output/privacy_synthesis"
)

# Generate dataset
pipeline = SyntheticCTPipeline(config)
pipeline.generate_dataset()

Generation Modes Comparison

1. Pure Synthetic

Purpose: Generate CT scans from procedural noise without real data input.

Characteristics:

  • No real patient data required
  • Full control over anatomical features and pathology
  • Quality suitable for algorithm development and testing
  • No privacy concerns (no patient data involved)

Limitations:

  • May lack some real-world anatomical variations
  • Texture patterns are synthetic

2. Augmentation

Purpose: Standard data augmentation for machine learning training.

Characteristics:

  • Applies geometric transformations (rotation, scaling)
  • Adds controlled noise
  • High fidelity to original (SSIM typically > 0.95)
  • Fast processing (~10 seconds per volume)

⚠️ Privacy Warning:

  • Does NOT provide privacy protection
  • Retains original voxel intensity patterns
  • Should only be used when privacy is not a concern
  • Not suitable for data sharing under HIPAA/GDPR

3. Privacy-Preserving Synthesis (Recommended for Sharing)

Purpose: Generate privacy-protected synthetic versions of real CT scans.

Method: Multi-Scale Statistical Texture Synthesis (MS-STS)

Process:

  1. Separates low-frequency anatomy from high-frequency texture
  2. Applies gradient-preserving intensity remapping to all tissue types
  3. Replaces texture with synthetic scanner-realistic noise
  4. Preserves anatomical structure while transforming intensities

Measured Performance (NSCLC-Radiomics test case):

  • Body region SSIM: 0.7880 (moderate similarity)
  • Lung region SSIM: 0.9527 (high similarity in diagnostic regions)
  • Mutual Information: 1.08 bits (low compared to original)

Privacy Considerations:

  • Aims to reduce mutual information below 1.8 bits
  • All voxel intensities undergo statistical transformation
  • Preserves anatomical topology
  • Trade-off between privacy protection and diagnostic utility

Limitations:

  • Privacy protection is empirical, not cryptographic
  • May not prevent all re-identification attacks
  • Should be combined with other de-identification methods
  • Requires validation for your specific use case

Output Formats

DICOM

Standard medical imaging format compatible with PACS systems.

medsynth --generate-dicom --output-dir ./output/

NRRD

3D volumetric format for visualization tools (3D Slicer, ITK-SNAP).

medsynth --generate-nrrd --output-dir ./output/

OMOP CDM

Healthcare data standard for multi-institutional studies.

medsynth --generate-omop --output-dir ./output/

Configuration

Custom Parameters

from medsynth.config import Config, VolumeConfig

config = Config(
    num_subjects=50,
    random_seed=42,
    privacy_synth_mode=True,
    augmentation_input="./input/ct/",
    volume=VolumeConfig(
        volume_shape=(178, 512, 512),
        spacing=(5.0, 0.976, 0.976),
        hu_range=(-1024, 3071),
        # MS-STS parameters (optimized values)
        privacy_synth_freq_cutoff=0.5456,
        privacy_synth_psf_blur_sigma=0.40,
        privacy_synth_texture_noise_std=5.5449,
        privacy_synth_edge_enhancement_strength=0.25,
    )
)

Quality Metrics

from medsynth.metrics import evaluate_synthetic_ct

results = evaluate_synthetic_ct(
    original=real_ct_volume,
    synthetic=synthetic_ct_volume,
    body_mask=body_mask,
    lung_mask=lung_mask,
    spacing=(5.0, 0.976, 0.976)
)

print(f"SSIM (body): {results['image_quality']['ssim_body']:.4f}")
print(f"PSNR: {results['image_quality']['psnr_body']:.2f} dB")
print(f"Mutual Information: {results['privacy']['mutual_information_body']:.3f} bits")

Available Metrics:

  • Image Quality: SSIM, PSNR, MS-SSIM, NMSE
  • Clinical Utility: SNR, CNR, Edge Sharpness
  • Privacy Analysis: Mutual Information, Histogram Distance, Texture Analysis

Examples

See examples/ directory:

  • example_pure_synthetic.py - Generate from scratch
  • example_augmentation.py - Augment existing CTs
  • example_privacy_synthesis.py - Privacy-preserving synthesis

Testing

pytest tests/ -v
pytest tests/ --cov=medsynth --cov-report=html

Citations & Data Usage

Dataset Citation (Required)

This package was developed and tested using the NSCLC-Radiomics dataset. Users of this data must abide by the TCIA Data Usage Policy and include the following citation:

Aerts, H. J. W. L., Wee, L., Rios Velazquez, E., Leijenaar, R. T. H., Parmar, C., Grossmann, P., Carvalho, S., Bussink, J., Monshouwer, R., Haibe-Kains, B., Rietveld, D., Hoebers, F., Rietbergen, M. M., Leemans, C. R., Dekker, A., Quackenbush, J., Gillies, R. J., Lambin, P. (2014). Data From NSCLC-Radiomics (version 4) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2015.PF0M9REI

Software Citation

If you use MedSynth in your research, please cite:

@software{medsynth2025,
  title = {MedSynth: Medical Synthetic Data Generator with Privacy-Preserving Synthesis},
  author = {Lohachab, Ankur},
  year = {2025},
  month = {11},
  version = {1.0.0},
  url = {https://github.com/ankurlohachab/medsynth}
}

Privacy & Security Disclaimer

Important: This software provides empirical privacy protection through statistical methods, NOT cryptographic guarantees.

  • Privacy-preserving synthesis aims to reduce mutual information and re-identification risk
  • Protection level depends on data characteristics, parameters, and adversary capabilities
  • Should be combined with other de-identification methods (e.g., metadata removal, expert review)
  • Not a substitute for proper de-identification workflows
  • Users are responsible for compliance with applicable regulations (HIPAA, GDPR, etc.)
  • Validation recommended for each specific use case and dataset

Not approved for clinical use. For research purposes only.


Requirements

  • Python ≥ 3.8
  • NumPy ≥ 1.21
  • SciPy ≥ 1.7
  • SimpleITK ≥ 2.1
  • PyDICOM ≥ 2.3
  • scikit-image ≥ 0.19
  • Pandas ≥ 1.3
  • Pydantic ≥ 2.0
  • pynrrd ≥ 1.0

Roadmap

  • Web interface
  • Cloud deployment

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

Development Transparency

AI-Assisted Development Disclosure: AI-assisted development tools were used exclusively for routine tasks such as syntactic error correction, formatting, and generating descriptive comments.


License

This project is licensed under the MIT License - see the LICENSE file for details.


Support


Acknowledgments

  • Optimization Dataset: NSCLC-Radiomics Collection from The Cancer Imaging Archive (TCIA)
  • Method: Multi-Scale Statistical Texture Synthesis (MS-STS) with gradient-preserving remapping
  • Optimization: 15-iteration parameter search conducted November 2025

Author: Ankur Lohachab

Affiliation: Department of Advanced Computing Sciences, Maastricht University

Contact: ankur.lohachab@maastrichtuniversity.nl

Date: November 26, 2025

Version: 1.0.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medsynth-1.0.0.tar.gz (102.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

medsynth-1.0.0-py3-none-any.whl (79.9 kB view details)

Uploaded Python 3

File details

Details for the file medsynth-1.0.0.tar.gz.

File metadata

  • Download URL: medsynth-1.0.0.tar.gz
  • Upload date:
  • Size: 102.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for medsynth-1.0.0.tar.gz
Algorithm Hash digest
SHA256 6dc260083984051f2d47883c01287536625c298cb776f2561c32e710c9c36d20
MD5 81fcc66c51f577895f2070d9a85c928c
BLAKE2b-256 4d66cd2b222698725078ead82b75997d382995c9bb4a6ad4215aac97f7fe7744

See more details on using hashes here.

File details

Details for the file medsynth-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: medsynth-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 79.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for medsynth-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 313ded94180f82f5cc0b5844f98747f486a7f3dec4c00737f0caccd46c1c959d
MD5 cb3423d3eb9bb26a0b3fb293b85b0d7c
BLAKE2b-256 d9769950f92a7eb0eda210f94f7f25bc709a7df911c269802472dc2944e0f8b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page