Skip to main content

Synthetic patient record generator (Synthea-inspired) trained on pristine-healthy episode data

Project description

๐Ÿฉบ syntha

A Synthea-inspired hybrid synthetic patient record generator. Learns the joint distribution of real anonymized Turkish-cohort EHR episodes with a Gaussian copula, then layers Synthea-style clinical pathways on top to emit fully-coded FHIR R4 bundles in Turkish.

CI Cross-platform Release Install buttons Codecov Latest release Downloads License: Apache 2.0 Python 3.10+ FHIR R4 Locale: tr-TR


What it is

syntha is a Python library, command-line tool, and signed cross-platform desktop app for generating realistic synthetic patient records โ€” flat CSVs and FHIR R4 transaction Bundles โ€” that match the statistical structure of an anonymized Turkish-cohort EHR while staying physiologically valid and clinically coded.

The pipeline is hybrid:

  1. Gaussian copula fitted on real anonymized episodes โ€” preserves marginal distributions (age, labs, vitals, comorbidity prevalence) and their joint correlation structure.
  2. Physiologic filter โ€” rejects samples that violate pulse-pressure, Friedewald lipid coherence, or eGFR โ†” creatinine constraints.
  3. Synthea-style clinical modules โ€” nine condition-specific state activations that emit Encounters, MedicationRequests (RxNorm-coded), Procedures, and CarePlans matching each patient's comorbidity profile.
  4. FHIR R4 export โ€” Patient + Observation + Condition + Encounter + MedicationRequest + Procedure + CarePlan + DiagnosticReport + RiskAssessment + FamilyMemberHistory, dual-coded LOINC / SNOMED CT / ICD-10 / RxNorm, Turkish locale (names, addresses, language code, display text).

Desktop app

Download macOS Apple Silicon (.dmg) ย  Download Windows installer (.exe) ย  Download Linux AppImage

A Tauri 2 app bundling the trained Gaussian copula. Picks cohort + n + seed + constraints, samples synthetic patients fully client-side (no Python at runtime), downloads a CSV. macOS DMG is Developer-ID signed + notarized + stapled. Windows installer is code-signed. All three OSes ship a minisign-signed auto-updater โ€” existing installs get an in-app upgrade banner on next launch.

Install URLs auto-resolve to the latest release via releases/latest/download/โ€ฆ โ€” no per-version link maintenance.

Install

# PyPI
pip install syntha-ehr

# Or from source
git clone https://github.com/ArioMoniri/syntha
cd syntha
pip install -e ".[dev]"

# Or Docker
docker pull ghcr.io/ariomoniri/syntha:latest

Quick start

# Generate 1 000 synthetic episodes + FHIR bundles + model card + validation report
syntha generate \
  --input data/raw/pristine_tolerant_episodes.csv \
  --output output/tolerant \
  --n 1000 --cohort tolerant

# Longitudinal โ€” multiple encounters per patient with shared HASTA_ID
syntha generate \
  --input data/raw/pristine_tolerant_episodes.csv \
  --output output/tolerant_long \
  --n 2000 --cohort tolerant \
  --longitudinal --encounters-per-patient 4 --years-of-history 3

# Validate a synthetic CSV against the source it was trained on
syntha validate \
  --source data/raw/pristine_tolerant_episodes.csv \
  --synthetic output/tolerant/synthetic_tolerant_episodes.csv \
  --output output/tolerant/validation.json

# Run a privacy audit (MIA + AIA)
syntha audit \
  --source data/raw/pristine_tolerant_episodes.csv \
  --synthetic output/tolerant/synthetic_tolerant_episodes.csv \
  --output output/tolerant/privacy.json

By default the CSV writer drops 29 source-pipeline curation flags (pristine_*, berturk_*, drug-safety filters, rf_*) โ€” those are training metadata, not clinical observations, and most are degenerate (constant 0 or 1) in the pristine cohort. Pass --curation-flags to keep them for QA work.

What it produces

For every synthetic patient, syntha emits a FHIR R4 transaction Bundle:

Resource Coding What
๐Ÿ‘ค Patient โ€” Turkish HumanName + Address (ISO 3166-2:TR province), communication.language = tr, derived birthDate
๐Ÿงช Observation ร—~12 LOINC Labs (glucose, full lipid panel, CBC, LFTs, eGFR/creatinine, ferritin, B12) + vitals (BP)
๐Ÿฉบ Condition ร—N SNOMED CT + ICD-10 Every active comorbidity, dual-coded, with English + clinical-Turkish display
๐Ÿฅ Encounter ร—M SNOMED CT One per active condition, fired by the relevant module
๐Ÿ’Š MedicationRequest ร—P RxNorm First-line therapy per condition, with dosage
๐Ÿ”ฌ Procedure ร—Q SNOMED CT HbA1c, lipid panel, ECG, spirometry, etc.
๐Ÿ“‹ CarePlan ร—R SNOMED CT Disease-specific lifestyle + monitoring plans
๐Ÿ“Š DiagnosticReport LOINC Lipid, CBC, CMP, iron, BP panels grouping their constituent Observations
๐ŸŽฏ RiskAssessment SNOMED CT Charlson Comorbidity Index
๐Ÿ‘ช FamilyMemberHistory SNOMED CT When rf_kanser / rf_kronik_hastalik are set

โ€ฆplus a flat CSV matching the input schema (minus the 29 dropped curation flags) for drop-in use as training data, a JSON model card with the source_sha256 and marginals, and a validation report.

Distribution fidelity

A 100-episode sample of tolerant vs the full 135 569-row source:

Metric Value
n source / synthetic 135 569 / 100
Max Kolmogorovโ€“Smirnov across continuous columns 0.14
Mean KS 0.07
Max binary-prevalence error 0.025 (has_rx_data)
Disease-prevalence error (HTN / DM / hyperlipidemia) 0.015 / 0.004 / 0.010
Spearman correlation Frobenius diff 2.94
Fraction of synthetic patients with all labs in reference range reported per cohort in validation_report.json

Marginals

Marginal distributions โ€” source vs synthetic

Spearman correlation structure

Spearman correlations โ€” source vs synthetic vs diff

Disease prevalence

Comorbidity prevalence โ€” source vs synthetic

FHIR endpoints

# Spin up a local read-only FHIR R4 server
syntha serve --bundles examples/sample_output/sample_bundles.ndjson --port 8080

# Then:
curl http://127.0.0.1:8080/metadata           # CapabilityStatement
curl http://127.0.0.1:8080/Patient            # search-set Bundle
curl http://127.0.0.1:8080/Patient/{id}
curl http://127.0.0.1:8080/\$export           # Bulk Data NDJSON

scripts/post_to_fhir.sh posts every transaction Bundle in an NDJSON file to any FHIR R4 endpoint (default: the public HAPI test server).

Turkish cohort + Turkish output

The trained models bundled with the desktop app and the example output come from pristine_strict_episodes.csv and pristine_tolerant_episodes.csv โ€” anonymized retrospective EHR episodes from a Turkish patient cohort selected to represent clinically pristine adults. The source CSVs themselves are gitignored and never redistributed.

The output is Turkish-localized:

  • Patient names sampled from Turkish given-name and family-name distributions (src/syntha/locale/turkish.py).
  • Addresses use Turkish cities weighted by approximate population with ISO 3166-2:TR province codes.
  • Every Condition carries both an English SNOMED display and a clinical-Turkish translation in Condition.code.text.
  • Patient.communication.language is tr.

All clinical terminology used (LOINC, SNOMED CT, ICD-10, RxNorm) comes from open international standards. No licensed terminology content is embedded.

Synthea-style clinical modules

Nine modules ship out of the box (src/syntha/modules/); each fires on its corresponding comorbidity flag.

Module Source flag(s) Emits
๐Ÿซ€ Hypertension Hipertansiyon Encounter, 1โ€“2 antihypertensives (stage 2 โ†’ dual), CarePlan
๐Ÿฌ Diabetes DM_Tum, DM_Komplikasyonlu Encounter, HbA1c, metformin (+ insulin if severe), CarePlan
๐Ÿง€ Hyperlipidemia Hiperlipidemi Encounter, lipid panel, statin (high-intensity if LDL โ‰ฅ 190)
๐Ÿฆ‹ Thyroid Tiroid Encounter, TSH, levothyroxine
๐Ÿ˜” Depression Depresyon Psych encounter, sertraline, CBT CarePlan
๐Ÿ˜ฐ Anxiety Anksiyete Psych encounter, escitalopram (or buspirone if already on an SSRI)
โค๏ธ Ischemic heart disease Iskemik_Kalp Cardiology encounter, ECG, aspirin + ฮฒ-blocker + statin
๐ŸŒฌ๏ธ Asthma Astim Resp encounter, spirometry, SABA + ICS
๐Ÿšญ COPD COPD Resp encounter, spirometry, LABA + SABA

Module authoring guide: docs/MODULES.md.

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Source CSV  โ”‚โ”€โ”€โ–ถโ”‚  Gaussian copula  โ”‚โ”€โ”€โ–ถโ”‚ Physiologic filter   โ”‚
โ”‚ (Turkish     โ”‚    โ”‚ (mixed-type ฯ;   โ”‚    โ”‚ (BP, Friedewald,     โ”‚
โ”‚  pristine)   โ”‚    โ”‚ nearest-PSD)     โ”‚    โ”‚  eGFR โ†” creatinine)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                                       โ”‚
                                  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                  โ”‚                                         โ”‚
                                  โ–ผ                                         โ–ผ
                       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                       โ”‚ Longitudinal     โ”‚   (optional)     โ”‚  Single-encounter CSV +  โ”‚
                       โ”‚ expansion        โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚  FHIR R4 export with      โ”‚
                       โ”‚ (drift, Poisson) โ”‚                  โ”‚  module activation        โ”‚
                       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                 โ”‚
                                 โ–ผ
                          (same FHIR export)

Full math (mixed-type correlation, nearest-PSD projection, conditional missingness, AR(1) lab drift): docs/ARCHITECTURE.md.

CLI reference

Command What
syntha generate End-to-end: train copula โ†’ sample โ†’ modules โ†’ CSV + FHIR + model card + validation
syntha fit Fit and persist a copula in a registry without sampling
syntha sample Raw sampling from a registered model
syntha sample-conditional AST-validated rejection sampling against a pandas filter expression
syntha fhir Convert an existing synthetic CSV to FHIR R4 bundles
syntha validate KS / Wasserstein / correlation diff + reference-range coverage
syntha audit Privacy audit (membership-inference + attribute-inference)
syntha serve Read-only FHIR R4 demo server
syntha export-model Export a registered copula to v2 JSON for the desktop app
syntha list-models, show-card Inspect the registry

Run syntha <cmd> --help for full option lists.

Example output

A pretty-printed sample Bundle, a 100-episode synthetic CSV, the model card, and the validation report all live under examples/sample_output/ and are tracked in git.

File What
sample_bundle_pretty.json One pretty-printed transaction Bundle
sample_bundles.ndjson 100 Bundles, one per line (Bulk-FHIR style)
sample_episodes.csv 100 synthetic episodes matching the input schema
sample_model_card.json source_sha256, n_train, marginals, top correlations
sample_validation_report.json KS / Wasserstein / correlation-Frobenius per column

For FHIR-aware rendering: drop the Bundle onto simplifier.net or the HL7 Clinical FHIR Renderer.

What it is not

  • Not privacy-proof. Gaussian copulas are not differentially private. Run syntha audit before sharing any synthetic dataset trained on a small or sensitive cohort.
  • Not a substitute for real PHI when validity hinges on rare events โ€” the copula reproduces the bulk of the joint distribution, not the long tails.
  • Not a population-representative Turkish cohort by default โ€” the source is selected for clinically-pristine adults, so synthetic disease prevalence is lower than TรœฤฐK national figures. Calibration to TรœฤฐK is a curation task โ€” see ROADMAP.md and COLLABORATE.md for how to help.

Contributing + collaboration

Open-source, Apache 2.0, contributions welcome from clinicians, data scientists, and software engineers alike. Three places to start:

  • ๐Ÿง‘โ€โš•๏ธ Clinicians โ€” see COLLABORATE.md for the live list of tasks needing clinical-Turkish guidance (drug calibration, ICD specificity, new modules), plus the in-app Collaborate panel that surfaces the same list with one-click "claim" via your GitHub handle.
  • ๐Ÿ’ป Developers โ€” CONTRIBUTING.md for dev setup, commit conventions, and the test matrix.
  • ๐Ÿ—บ๏ธ Project direction โ€” ROADMAP.md for the staged plan, what's shipped, and what's queued.

License + citation

Apache 2.0 ยฉ 2026 Ariorad Moniri โ€” see LICENSE. If you use syntha in academic work, please cite:

Moniri, A. (2026). syntha: hybrid synthetic patient record generator
trained on Turkish pristine-healthy EHR cohorts.
https://github.com/ArioMoniri/syntha

Acknowledgements

Project What it gives us
๐Ÿฉบ Synthea Inspiration for the clinical-module layer and FHIR output format
๐Ÿงช LOINC Lab and observation codes
๐Ÿงฌ SNOMED CT Condition, procedure, encounter, and care-plan terminology
๐Ÿ“‘ ICD-10 Diagnosis coding alongside SNOMED
๐Ÿ’Š RxNorm Medication coding
๐Ÿ“Š Turkish-cohort EHR data steward De-identified retrospective episodes (anonymized upstream; never redistributed by this repo)

Contributors

Want to be on this list? See COLLABORATE.md or pick a card in the in-app Collaborate panel.

Ariorad Moniri
Ariorad Moniri

๐Ÿง‘โ€๐Ÿ’ผ ๐Ÿ’ป ๐ŸŽจ ๐Ÿ“– ๐Ÿšง ๐Ÿค” ๐Ÿ‘€ ๐Ÿš‡ โš ๏ธ

Powered by all-contributors โ€” comment @all-contributors please add @username for code,doc on any issue or PR to nominate someone.

Community

๐Ÿ’ฌ Discussions
Open questions, "is this the right tool for X?", show-and-tell
๐Ÿ› Issues
Bug reports + feature requests + clinical curation
๐Ÿค Collaborate
Live list of clinician + dev + data tasks ยท also surfaced in the desktop app
๐Ÿ“– Contributing
Dev setup, commit conventions, test matrix
๐Ÿ—บ๏ธ Roadmap
Shipped + queued + what needs a clinician
๐Ÿ“‹ Changelog
Semver, Keep-a-Changelog, generated by release-please

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syntha_ehr-0.5.8.tar.gz (85.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syntha_ehr-0.5.8-py3-none-any.whl (76.0 kB view details)

Uploaded Python 3

File details

Details for the file syntha_ehr-0.5.8.tar.gz.

File metadata

  • Download URL: syntha_ehr-0.5.8.tar.gz
  • Upload date:
  • Size: 85.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for syntha_ehr-0.5.8.tar.gz
Algorithm Hash digest
SHA256 42aba811150998662f4f322cf7edc3adbc2c00c7deb1888208517494dc8d61ce
MD5 91776ccec4c0dd17e7a2404135091499
BLAKE2b-256 00b5cd384d90404e32cdd95856c673212e3571abee3112ceb78c05f9c7c3e0e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for syntha_ehr-0.5.8.tar.gz:

Publisher: pypi-publish.yml on ArioMoniri/syntha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file syntha_ehr-0.5.8-py3-none-any.whl.

File metadata

  • Download URL: syntha_ehr-0.5.8-py3-none-any.whl
  • Upload date:
  • Size: 76.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for syntha_ehr-0.5.8-py3-none-any.whl
Algorithm Hash digest
SHA256 fdc51e81989b617705846117084835b83bcbd31b8ee729810081040efbd39ed2
MD5 cefd4ae089b2cd1fd606979f7537af22
BLAKE2b-256 d20851e7035cac1d0f4090040cc13614c7063e6e4e804897b013804b754ac71c

See more details on using hashes here.

Provenance

The following attestation bundles were made for syntha_ehr-0.5.8-py3-none-any.whl:

Publisher: pypi-publish.yml on ArioMoniri/syntha

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page