Medical Synthetic Data Generator with Privacy-Preserving Synthesis
Project description
MedSynth: Medical Synthetic Data Generator with Privacy-Preserving Synthesis
MedSynth is a Python package for generating synthetic medical images with configurable privacy protection. It creates realistic CT scans using statistical methods without requiring machine learning or GPU resources.
Key Features
✨ Three Generation Modes:
- Pure Synthetic - Generate CT scans from scratch using procedural methods
- Augmentation - Standard augmentation of real CTs (rotation, scaling, noise)
- Privacy-Preserving Synthesis - Template-based synthesis with statistical privacy protection using Multi-Scale Statistical Texture Synthesis (MS-STS)
🔒 Privacy Protection (Mode 3 only):
- Low mutual information (empirically < 1.8 bits on tested datasets)
- Synthetic intensity remapping via gradient-preserving transformations
- Designed to reduce re-identification risk
📊 Multiple Output Formats:
- DICOM (medical imaging standard)
- NRRD (3D visualization)
- OMOP CDM (healthcare data standard)
⚡ Computational Efficiency:
- No GPU or deep learning required
- Runs on standard workstations
- Approximately 67 seconds per 178-slice CT volume (hardware-dependent)
Performance Benchmarks
Evaluated on NSCLC-Radiomics dataset (November 26, 2025):
Privacy-Preserving Synthesis Mode (Optimized Parameters)
| Metric | Value | Interpretation |
|---|---|---|
| SSIM (body region) | 0.7880 | Moderate structural similarity |
| SSIM (lung region) | 0.9527 | High structural similarity |
| Mutual Information | 1.08 bits | Low information leakage |
| Generation Speed | 67 sec/volume | Single-threaded on M-series Mac |
| Voxel Remapping | 100% | All intensities transformed |
Optimized Parameters (15-iteration empirical search):
- Frequency domain cutoff: 0.5456
- Point spread function blur: 0.40
- Texture noise standard deviation: 5.5449
- Edge enhancement strength: 0.25
Note: Performance may vary on different datasets, scanners, and protocols. These benchmarks represent single-case optimization and should be validated on your specific data.
Installation
From PyPI (when published)
pip install medsynth
From Source
git clone https://github.com/ankurlohachab/medsynth.git
cd medsynth
pip install -e .
Development Installation
pip install -e ".[dev]"
pytest tests/
Quick Start
Command Line Interface
1. Pure Synthetic Generation
medsynth \
--num-subjects 10 \
--output-dir ./output/pure_synthetic/
2. Augmentation Mode
medsynth \
--augment ./path/to/real/ct/dicom_folder/ \
--num-subjects 5 \
--output-dir ./output/augmented/
⚠️ Note: Augmentation mode does NOT provide privacy protection - use for data augmentation only.
3. Privacy-Preserving Synthesis Mode
medsynth \
--privacy-synthesis \
--augment ./path/to/real/ct/dicom_folder/ \
--num-subjects 10 \
--output-dir ./output/privacy_synthesis/
Python API
from medsynth.config import Config
from medsynth.pipeline import SyntheticCTPipeline
# Configure privacy-synthesis
config = Config(
num_subjects=10,
privacy_synth_mode=True, # Privacy-synthesis using MS-STS
augmentation_input="./path/to/real/ct",
output_root="./output/privacy_synthesis"
)
# Generate dataset
pipeline = SyntheticCTPipeline(config)
pipeline.generate_dataset()
Generation Modes Comparison
1. Pure Synthetic
Purpose: Generate CT scans from procedural noise without real data input.
Characteristics:
- No real patient data required
- Full control over anatomical features and pathology
- Quality suitable for algorithm development and testing
- No privacy concerns (no patient data involved)
Limitations:
- May lack some real-world anatomical variations
- Texture patterns are synthetic
2. Augmentation
Purpose: Standard data augmentation for machine learning training.
Characteristics:
- Applies geometric transformations (rotation, scaling)
- Adds controlled noise
- High fidelity to original (SSIM typically > 0.95)
- Fast processing (~10 seconds per volume)
⚠️ Privacy Warning:
- Does NOT provide privacy protection
- Retains original voxel intensity patterns
- Should only be used when privacy is not a concern
- Not suitable for data sharing under HIPAA/GDPR
3. Privacy-Preserving Synthesis (Recommended for Sharing)
Purpose: Generate privacy-protected synthetic versions of real CT scans.
Method: Multi-Scale Statistical Texture Synthesis (MS-STS)
Process:
- Separates low-frequency anatomy from high-frequency texture
- Applies gradient-preserving intensity remapping to all tissue types
- Replaces texture with synthetic scanner-realistic noise
- Preserves anatomical structure while transforming intensities
Measured Performance (NSCLC-Radiomics test case):
- Body region SSIM: 0.7880 (moderate similarity)
- Lung region SSIM: 0.9527 (high similarity in diagnostic regions)
- Mutual Information: 1.08 bits (low compared to original)
Privacy Considerations:
- Aims to reduce mutual information below 1.8 bits
- All voxel intensities undergo statistical transformation
- Preserves anatomical topology
- Trade-off between privacy protection and diagnostic utility
Limitations:
- Privacy protection is empirical, not cryptographic
- May not prevent all re-identification attacks
- Should be combined with other de-identification methods
- Requires validation for your specific use case
Output Formats
DICOM
Standard medical imaging format compatible with PACS systems.
medsynth --generate-dicom --output-dir ./output/
NRRD
3D volumetric format for visualization tools (3D Slicer, ITK-SNAP).
medsynth --generate-nrrd --output-dir ./output/
OMOP CDM
Healthcare data standard for multi-institutional studies.
medsynth --generate-omop --output-dir ./output/
Configuration
Custom Parameters
from medsynth.config import Config, VolumeConfig
config = Config(
num_subjects=50,
random_seed=42,
privacy_synth_mode=True,
augmentation_input="./input/ct/",
volume=VolumeConfig(
volume_shape=(178, 512, 512),
spacing=(5.0, 0.976, 0.976),
hu_range=(-1024, 3071),
# MS-STS parameters (optimized values)
privacy_synth_freq_cutoff=0.5456,
privacy_synth_psf_blur_sigma=0.40,
privacy_synth_texture_noise_std=5.5449,
privacy_synth_edge_enhancement_strength=0.25,
)
)
Quality Metrics
from medsynth.metrics import evaluate_synthetic_ct
results = evaluate_synthetic_ct(
original=real_ct_volume,
synthetic=synthetic_ct_volume,
body_mask=body_mask,
lung_mask=lung_mask,
spacing=(5.0, 0.976, 0.976)
)
print(f"SSIM (body): {results['image_quality']['ssim_body']:.4f}")
print(f"PSNR: {results['image_quality']['psnr_body']:.2f} dB")
print(f"Mutual Information: {results['privacy']['mutual_information_body']:.3f} bits")
Available Metrics:
- Image Quality: SSIM, PSNR, MS-SSIM, NMSE
- Clinical Utility: SNR, CNR, Edge Sharpness
- Privacy Analysis: Mutual Information, Histogram Distance, Texture Analysis
Examples
See examples/ directory:
example_pure_synthetic.py- Generate from scratchexample_augmentation.py- Augment existing CTsexample_privacy_synthesis.py- Privacy-preserving synthesis
Testing
pytest tests/ -v
pytest tests/ --cov=medsynth --cov-report=html
Citations & Data Usage
Dataset Citation (Required)
This package was developed and tested using the NSCLC-Radiomics dataset. Users of this data must abide by the TCIA Data Usage Policy and include the following citation:
Aerts, H. J. W. L., Wee, L., Rios Velazquez, E., Leijenaar, R. T. H., Parmar, C., Grossmann, P., Carvalho, S., Bussink, J., Monshouwer, R., Haibe-Kains, B., Rietveld, D., Hoebers, F., Rietbergen, M. M., Leemans, C. R., Dekker, A., Quackenbush, J., Gillies, R. J., Lambin, P. (2014). Data From NSCLC-Radiomics (version 4) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2015.PF0M9REI
Software Citation
If you use MedSynth in your research, please cite:
@software{medsynth2025,
title = {MedSynth: Medical Synthetic Data Generator with Privacy-Preserving Synthesis},
author = {Lohachab, Ankur},
year = {2025},
month = {11},
version = {1.0.0},
url = {https://github.com/ankurlohachab/medsynth}
}
Privacy & Security Disclaimer
Important: This software provides empirical privacy protection through statistical methods, NOT cryptographic guarantees.
- Privacy-preserving synthesis aims to reduce mutual information and re-identification risk
- Protection level depends on data characteristics, parameters, and adversary capabilities
- Should be combined with other de-identification methods (e.g., metadata removal, expert review)
- Not a substitute for proper de-identification workflows
- Users are responsible for compliance with applicable regulations (HIPAA, GDPR, etc.)
- Validation recommended for each specific use case and dataset
Not approved for clinical use. For research purposes only.
Requirements
- Python ≥ 3.8
- NumPy ≥ 1.21
- SciPy ≥ 1.7
- SimpleITK ≥ 2.1
- PyDICOM ≥ 2.3
- scikit-image ≥ 0.19
- Pandas ≥ 1.3
- Pydantic ≥ 2.0
- pynrrd ≥ 1.0
Roadmap
- Web interface
- Cloud deployment
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
Development Transparency
AI-Assisted Development Disclosure: AI-assisted development tools were used exclusively for routine tasks such as syntactic error correction, formatting, and generating descriptive comments.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Acknowledgments
- Optimization Dataset: NSCLC-Radiomics Collection from The Cancer Imaging Archive (TCIA)
- Method: Multi-Scale Statistical Texture Synthesis (MS-STS) with gradient-preserving remapping
- Optimization: 15-iteration parameter search conducted November 2025
Author: Ankur Lohachab
Affiliation: Department of Advanced Computing Sciences, Maastricht University
Contact: ankur.lohachab@maastrichtuniversity.nl
Date: November 26, 2025
Version: 1.0.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file medsynth-1.0.0.tar.gz.
File metadata
- Download URL: medsynth-1.0.0.tar.gz
- Upload date:
- Size: 102.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6dc260083984051f2d47883c01287536625c298cb776f2561c32e710c9c36d20
|
|
| MD5 |
81fcc66c51f577895f2070d9a85c928c
|
|
| BLAKE2b-256 |
4d66cd2b222698725078ead82b75997d382995c9bb4a6ad4215aac97f7fe7744
|
File details
Details for the file medsynth-1.0.0-py3-none-any.whl.
File metadata
- Download URL: medsynth-1.0.0-py3-none-any.whl
- Upload date:
- Size: 79.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
313ded94180f82f5cc0b5844f98747f486a7f3dec4c00737f0caccd46c1c959d
|
|
| MD5 |
cb3423d3eb9bb26a0b3fb293b85b0d7c
|
|
| BLAKE2b-256 |
d9769950f92a7eb0eda210f94f7f25bc709a7df911c269802472dc2944e0f8b8
|