Skip to main content

Synthetic patient record generator (Synthea-inspired) trained on pristine-healthy episode data

Project description

๐Ÿฉบ syntha

A Synthea-inspired hybrid synthetic patient record generator. Learns the joint distribution of real anonymized Turkish-cohort EHR episodes with a Gaussian copula, then layers Synthea-style clinical pathways on top to emit fully-coded FHIR R4 bundles in Turkish.

CI Cross-platform Release Install buttons Codecov Latest release Downloads License: Apache 2.0 Python 3.10+ FHIR R4 Locale: tr-TR


What it is

syntha is a Python library, command-line tool, and signed cross-platform desktop app for generating realistic synthetic patient records โ€” flat CSVs and FHIR R4 transaction Bundles โ€” that match the statistical structure of an anonymized Turkish-cohort EHR while staying physiologically valid and clinically coded.

The pipeline is hybrid:

  1. Gaussian copula fitted on real anonymized episodes โ€” preserves marginal distributions (age, labs, vitals, comorbidity prevalence) and their joint correlation structure.
  2. Physiologic filter โ€” rejects samples that violate pulse-pressure, Friedewald lipid coherence, or eGFR โ†” creatinine constraints.
  3. Synthea-style clinical modules โ€” nine condition-specific state activations that emit Encounters, MedicationRequests (RxNorm-coded), Procedures, and CarePlans matching each patient's comorbidity profile.
  4. FHIR R4 export โ€” Patient + Observation + Condition + Encounter + MedicationRequest + Procedure + CarePlan + DiagnosticReport + RiskAssessment + FamilyMemberHistory, dual-coded LOINC / SNOMED CT / ICD-10 / RxNorm, Turkish locale (names, addresses, language code, display text).

Desktop app

Download macOS Apple Silicon (.dmg) ย  Download Windows installer (.exe) ย  Download Linux AppImage

A Tauri 2 app bundling the trained Gaussian copula. Picks cohort + n + seed + constraints, samples synthetic patients fully client-side (no Python at runtime), downloads a CSV. macOS DMG is Developer-ID signed + notarized + stapled. Windows installer is code-signed. All three OSes ship a minisign-signed auto-updater โ€” existing installs get an in-app upgrade banner on next launch.

Install URLs auto-resolve to the latest release via releases/latest/download/โ€ฆ โ€” no per-version link maintenance.

Install

# PyPI
pip install syntha-ehr

# Or from source
git clone https://github.com/ArioMoniri/syntha
cd syntha
pip install -e ".[dev]"

# Or Docker
docker pull ghcr.io/ariomoniri/syntha:latest

Quick start

# Generate 1 000 synthetic episodes + FHIR bundles + model card + validation report
syntha generate \
  --input data/raw/pristine_tolerant_episodes.csv \
  --output output/tolerant \
  --n 1000 --cohort tolerant

# Longitudinal โ€” multiple encounters per patient with shared HASTA_ID
syntha generate \
  --input data/raw/pristine_tolerant_episodes.csv \
  --output output/tolerant_long \
  --n 2000 --cohort tolerant \
  --longitudinal --encounters-per-patient 4 --years-of-history 3

# Validate a synthetic CSV against the source it was trained on
syntha validate \
  --source data/raw/pristine_tolerant_episodes.csv \
  --synthetic output/tolerant/synthetic_tolerant_episodes.csv \
  --output output/tolerant/validation.json

# Run a privacy audit (MIA + AIA)
syntha audit \
  --source data/raw/pristine_tolerant_episodes.csv \
  --synthetic output/tolerant/synthetic_tolerant_episodes.csv \
  --output output/tolerant/privacy.json

By default the CSV writer drops 29 source-pipeline curation flags (pristine_*, berturk_*, drug-safety filters, rf_*) โ€” those are training metadata, not clinical observations, and most are degenerate (constant 0 or 1) in the pristine cohort. Pass --curation-flags to keep them for QA work.

What it produces

For every synthetic patient, syntha emits a FHIR R4 transaction Bundle:

Resource Coding What
๐Ÿ‘ค Patient โ€” Turkish HumanName + Address (ISO 3166-2:TR province), communication.language = tr, derived birthDate
๐Ÿงช Observation ร—~12 LOINC Labs (glucose, full lipid panel, CBC, LFTs, eGFR/creatinine, ferritin, B12) + vitals (BP)
๐Ÿฉบ Condition ร—N SNOMED CT + ICD-10 Every active comorbidity, dual-coded, with English + clinical-Turkish display
๐Ÿฅ Encounter ร—M SNOMED CT One per active condition, fired by the relevant module
๐Ÿ’Š MedicationRequest ร—P RxNorm First-line therapy per condition, with dosage
๐Ÿ”ฌ Procedure ร—Q SNOMED CT HbA1c, lipid panel, ECG, spirometry, etc.
๐Ÿ“‹ CarePlan ร—R SNOMED CT Disease-specific lifestyle + monitoring plans
๐Ÿ“Š DiagnosticReport LOINC Lipid, CBC, CMP, iron, BP panels grouping their constituent Observations
๐ŸŽฏ RiskAssessment SNOMED CT Charlson Comorbidity Index
๐Ÿ‘ช FamilyMemberHistory SNOMED CT When rf_kanser / rf_kronik_hastalik are set

โ€ฆplus a flat CSV matching the input schema (minus the 29 dropped curation flags) for drop-in use as training data, a JSON model card with the source_sha256 and marginals, and a validation report.

Distribution fidelity

A 100-episode sample of tolerant vs the full 135 569-row source:

Metric Value
n source / synthetic 135 569 / 100
Max Kolmogorovโ€“Smirnov across continuous columns 0.14
Mean KS 0.07
Max binary-prevalence error 0.025 (has_rx_data)
Disease-prevalence error (HTN / DM / hyperlipidemia) 0.015 / 0.004 / 0.010
Spearman correlation Frobenius diff 2.94
Fraction of synthetic patients with all labs in reference range reported per cohort in validation_report.json

Marginals

Marginal distributions โ€” source vs synthetic

Spearman correlation structure

Spearman correlations โ€” source vs synthetic vs diff

Disease prevalence

Comorbidity prevalence โ€” source vs synthetic

FHIR endpoints

# Spin up a local read-only FHIR R4 server
syntha serve --bundles examples/sample_output/sample_bundles.ndjson --port 8080

# Then:
curl http://127.0.0.1:8080/metadata           # CapabilityStatement
curl http://127.0.0.1:8080/Patient            # search-set Bundle
curl http://127.0.0.1:8080/Patient/{id}
curl http://127.0.0.1:8080/\$export           # Bulk Data NDJSON

scripts/post_to_fhir.sh posts every transaction Bundle in an NDJSON file to any FHIR R4 endpoint (default: the public HAPI test server).

Turkish cohort + Turkish output

The trained models bundled with the desktop app and the example output come from pristine_strict_episodes.csv and pristine_tolerant_episodes.csv โ€” anonymized retrospective EHR episodes from a Turkish patient cohort selected to represent clinically pristine adults. The source CSVs themselves are gitignored and never redistributed.

The output is Turkish-localized:

  • Patient names sampled from Turkish given-name and family-name distributions (src/syntha/locale/turkish.py).
  • Addresses use Turkish cities weighted by approximate population with ISO 3166-2:TR province codes.
  • Every Condition carries both an English SNOMED display and a clinical-Turkish translation in Condition.code.text.
  • Patient.communication.language is tr.

All clinical terminology used (LOINC, SNOMED CT, ICD-10, RxNorm) comes from open international standards. No licensed terminology content is embedded.

Synthea-style clinical modules

Nine modules ship out of the box (src/syntha/modules/); each fires on its corresponding comorbidity flag.

Module Source flag(s) Emits
๐Ÿซ€ Hypertension Hipertansiyon Encounter, 1โ€“2 antihypertensives (stage 2 โ†’ dual), CarePlan
๐Ÿฌ Diabetes DM_Tum, DM_Komplikasyonlu Encounter, HbA1c, metformin (+ insulin if severe), CarePlan
๐Ÿง€ Hyperlipidemia Hiperlipidemi Encounter, lipid panel, statin (high-intensity if LDL โ‰ฅ 190)
๐Ÿฆ‹ Thyroid Tiroid Encounter, TSH, levothyroxine
๐Ÿ˜” Depression Depresyon Psych encounter, sertraline, CBT CarePlan
๐Ÿ˜ฐ Anxiety Anksiyete Psych encounter, escitalopram (or buspirone if already on an SSRI)
โค๏ธ Ischemic heart disease Iskemik_Kalp Cardiology encounter, ECG, aspirin + ฮฒ-blocker + statin
๐ŸŒฌ๏ธ Asthma Astim Resp encounter, spirometry, SABA + ICS
๐Ÿšญ COPD COPD Resp encounter, spirometry, LABA + SABA

Module authoring guide: docs/MODULES.md.

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Source CSV  โ”‚โ”€โ”€โ–ถโ”‚  Gaussian copula  โ”‚โ”€โ”€โ–ถโ”‚ Physiologic filter   โ”‚
โ”‚ (Turkish     โ”‚    โ”‚ (mixed-type ฯ;   โ”‚    โ”‚ (BP, Friedewald,     โ”‚
โ”‚  pristine)   โ”‚    โ”‚ nearest-PSD)     โ”‚    โ”‚  eGFR โ†” creatinine)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                                       โ”‚
                                  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                  โ”‚                                         โ”‚
                                  โ–ผ                                         โ–ผ
                       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                       โ”‚ Longitudinal     โ”‚   (optional)     โ”‚  Single-encounter CSV +  โ”‚
                       โ”‚ expansion        โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚  FHIR R4 export with      โ”‚
                       โ”‚ (drift, Poisson) โ”‚                  โ”‚  module activation        โ”‚
                       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                 โ”‚
                                 โ–ผ
                          (same FHIR export)

Full math (mixed-type correlation, nearest-PSD projection, conditional missingness, AR(1) lab drift): docs/ARCHITECTURE.md.

CLI reference

Command What
syntha generate End-to-end: train copula โ†’ sample โ†’ modules โ†’ CSV + FHIR + model card + validation
syntha fit Fit and persist a copula in a registry without sampling
syntha sample Raw sampling from a registered model
syntha sample-conditional AST-validated rejection sampling against a pandas filter expression
syntha fhir Convert an existing synthetic CSV to FHIR R4 bundles
syntha validate KS / Wasserstein / correlation diff + reference-range coverage
syntha audit Privacy audit (membership-inference + attribute-inference)
syntha serve Read-only FHIR R4 demo server
syntha export-model Export a registered copula to v2 JSON for the desktop app
syntha list-models, show-card Inspect the registry

Run syntha <cmd> --help for full option lists.

Example output

A pretty-printed sample Bundle, a 100-episode synthetic CSV, the model card, and the validation report all live under examples/sample_output/ and are tracked in git.

File What
sample_bundle_pretty.json One pretty-printed transaction Bundle
sample_bundles.ndjson 100 Bundles, one per line (Bulk-FHIR style)
sample_episodes.csv 100 synthetic episodes matching the input schema
sample_model_card.json source_sha256, n_train, marginals, top correlations
sample_validation_report.json KS / Wasserstein / correlation-Frobenius per column

For FHIR-aware rendering: drop the Bundle onto simplifier.net or the HL7 Clinical FHIR Renderer.

What it is not

  • Not privacy-proof. Gaussian copulas are not differentially private. Run syntha audit before sharing any synthetic dataset trained on a small or sensitive cohort.
  • Not a substitute for real PHI when validity hinges on rare events โ€” the copula reproduces the bulk of the joint distribution, not the long tails.
  • Not a population-representative Turkish cohort by default โ€” the source is selected for clinically-pristine adults, so synthetic disease prevalence is lower than TรœฤฐK national figures. Calibration to TรœฤฐK is a curation task โ€” see ROADMAP.md and COLLABORATE.md for how to help.

Contributing + collaboration

Open-source, Apache 2.0, contributions welcome from clinicians, data scientists, and software engineers alike. Three places to start:

  • ๐Ÿง‘โ€โš•๏ธ Clinicians โ€” see COLLABORATE.md for the live list of tasks needing clinical-Turkish guidance (drug calibration, ICD specificity, new modules), plus the in-app Collaborate panel that surfaces the same list with one-click "claim" via your GitHub handle.
  • ๐Ÿ’ป Developers โ€” CONTRIBUTING.md for dev setup, commit conventions, and the test matrix.
  • ๐Ÿ—บ๏ธ Project direction โ€” ROADMAP.md for the staged plan, what's shipped, and what's queued.

License + citation

Apache 2.0 ยฉ 2026 Ariorad Moniri โ€” see LICENSE. If you use syntha in academic work, please cite:

Moniri, A. (2026). syntha: hybrid synthetic patient record generator
trained on Turkish pristine-healthy EHR cohorts.
https://github.com/ArioMoniri/syntha

Acknowledgements

  • ๐Ÿฉบ Synthea โ€” inspiration for the clinical-module layer and FHIR output format.
  • ๐ŸŒ LOINC, SNOMED CT, ICD-10, RxNorm โ€” open clinical terminologies.
  • ๐Ÿ“Š The anonymized Turkish-cohort EHR data steward (de-identified upstream; never redistributed here).

Contributors

Ariorad Moniri
Ariorad Moniri

๐Ÿ’ป ๐ŸŽจ ๐Ÿ“– ๐Ÿšง ๐Ÿค” ๐Ÿ‘€ ๐Ÿš‡ โš ๏ธ

all-contributors โ€” comment @all-contributors please add @username for code,doc on any issue or PR to nominate someone.

Community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syntha_ehr-0.5.7.tar.gz (85.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syntha_ehr-0.5.7-py3-none-any.whl (75.6 kB view details)

Uploaded Python 3

File details

Details for the file syntha_ehr-0.5.7.tar.gz.

File metadata

  • Download URL: syntha_ehr-0.5.7.tar.gz
  • Upload date:
  • Size: 85.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for syntha_ehr-0.5.7.tar.gz
Algorithm Hash digest
SHA256 431094ef0c63f26d287582f79e2a8e036701aa2a99b66e33aec70fa09948848c
MD5 1af96a49212c8b43f7bf6fa069fcef42
BLAKE2b-256 b5c09ab160dc9e04a246da1e1874ad4976c8126c85ac187ede5b066f19983741

See more details on using hashes here.

Provenance

The following attestation bundles were made for syntha_ehr-0.5.7.tar.gz:

Publisher: pypi-publish.yml on ArioMoniri/syntha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file syntha_ehr-0.5.7-py3-none-any.whl.

File metadata

  • Download URL: syntha_ehr-0.5.7-py3-none-any.whl
  • Upload date:
  • Size: 75.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for syntha_ehr-0.5.7-py3-none-any.whl
Algorithm Hash digest
SHA256 dfb340a9a22984518b2da0eaa6a9f4823fb54e58b975733eb9afa8b56671928f
MD5 398c0342561d6ed1fb5eac6cef2763ee
BLAKE2b-256 8ab1cb462876c5f37d0d80439c3b12fa3e9f868a8fe1a7fbc02fe9a099eb3050

See more details on using hashes here.

Provenance

The following attestation bundles were made for syntha_ehr-0.5.7-py3-none-any.whl:

Publisher: pypi-publish.yml on ArioMoniri/syntha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page