Synthetic patient record generator (Synthea-inspired) trained on pristine-healthy episode data
Project description
๐ฉบ syntha
A Synthea-inspired hybrid synthetic patient record generator. Learns the joint distribution of real anonymized Turkish-cohort EHR episodes with a Gaussian copula, then layers Synthea-style clinical pathways on top to emit fully-coded FHIR R4 bundles in Turkish.
What it is
syntha is a Python library, command-line tool, and signed cross-platform desktop app for generating realistic synthetic patient records โ flat CSVs and FHIR R4 transaction Bundles โ that match the statistical structure of an anonymized Turkish-cohort EHR while staying physiologically valid and clinically coded.
The pipeline is hybrid:
- Gaussian copula fitted on real anonymized episodes โ preserves marginal distributions (age, labs, vitals, comorbidity prevalence) and their joint correlation structure.
- Physiologic filter โ rejects samples that violate pulse-pressure, Friedewald lipid coherence, or eGFR โ creatinine constraints.
- Synthea-style clinical modules โ nine condition-specific state activations that emit Encounters, MedicationRequests (RxNorm-coded), Procedures, and CarePlans matching each patient's comorbidity profile.
- FHIR R4 export โ Patient + Observation + Condition + Encounter + MedicationRequest + Procedure + CarePlan + DiagnosticReport + RiskAssessment + FamilyMemberHistory, dual-coded LOINC / SNOMED CT / ICD-10 / RxNorm, Turkish locale (names, addresses, language code, display text).
Desktop app
A Tauri 2 app bundling the trained Gaussian copula. Picks cohort + n + seed + constraints, samples synthetic patients fully client-side (no Python at runtime), downloads a CSV. macOS DMG is Developer-ID signed + notarized + stapled. Windows installer is code-signed. All three OSes ship a minisign-signed auto-updater โ existing installs get an in-app upgrade banner on next launch.
Install URLs auto-resolve to the latest release via releases/latest/download/โฆ โ no per-version link maintenance.
Install
# PyPI
pip install syntha-ehr
# Or from source
git clone https://github.com/ArioMoniri/syntha
cd syntha
pip install -e ".[dev]"
# Or Docker
docker pull ghcr.io/ariomoniri/syntha:latest
Quick start
# Generate 1 000 synthetic episodes + FHIR bundles + model card + validation report
syntha generate \
--input data/raw/pristine_tolerant_episodes.csv \
--output output/tolerant \
--n 1000 --cohort tolerant
# Longitudinal โ multiple encounters per patient with shared HASTA_ID
syntha generate \
--input data/raw/pristine_tolerant_episodes.csv \
--output output/tolerant_long \
--n 2000 --cohort tolerant \
--longitudinal --encounters-per-patient 4 --years-of-history 3
# Validate a synthetic CSV against the source it was trained on
syntha validate \
--source data/raw/pristine_tolerant_episodes.csv \
--synthetic output/tolerant/synthetic_tolerant_episodes.csv \
--output output/tolerant/validation.json
# Run a privacy audit (MIA + AIA)
syntha audit \
--source data/raw/pristine_tolerant_episodes.csv \
--synthetic output/tolerant/synthetic_tolerant_episodes.csv \
--output output/tolerant/privacy.json
By default the CSV writer drops 29 source-pipeline curation flags (pristine_*, berturk_*, drug-safety filters, rf_*) โ those are training metadata, not clinical observations, and most are degenerate (constant 0 or 1) in the pristine cohort. Pass --curation-flags to keep them for QA work.
What it produces
For every synthetic patient, syntha emits a FHIR R4 transaction Bundle:
| Resource | Coding | What |
|---|---|---|
| ๐ค Patient | โ | Turkish HumanName + Address (ISO 3166-2:TR province), communication.language = tr, derived birthDate |
| ๐งช Observation ร~12 | LOINC | Labs (glucose, full lipid panel, CBC, LFTs, eGFR/creatinine, ferritin, B12) + vitals (BP) |
| ๐ฉบ Condition รN | SNOMED CT + ICD-10 | Every active comorbidity, dual-coded, with English + clinical-Turkish display |
| ๐ฅ Encounter รM | SNOMED CT | One per active condition, fired by the relevant module |
| ๐ MedicationRequest รP | RxNorm | First-line therapy per condition, with dosage |
| ๐ฌ Procedure รQ | SNOMED CT | HbA1c, lipid panel, ECG, spirometry, etc. |
| ๐ CarePlan รR | SNOMED CT | Disease-specific lifestyle + monitoring plans |
| ๐ DiagnosticReport | LOINC | Lipid, CBC, CMP, iron, BP panels grouping their constituent Observations |
| ๐ฏ RiskAssessment | SNOMED CT | Charlson Comorbidity Index |
| ๐ช FamilyMemberHistory | SNOMED CT | When rf_kanser / rf_kronik_hastalik are set |
โฆplus a flat CSV matching the input schema (minus the 29 dropped curation flags) for drop-in use as training data, a JSON model card with the source_sha256 and marginals, and a validation report.
Distribution fidelity
A 100-episode sample of tolerant vs the full 135 569-row source:
| Metric | Value |
|---|---|
| n source / synthetic | 135 569 / 100 |
| Max KolmogorovโSmirnov across continuous columns | 0.14 |
| Mean KS | 0.07 |
| Max binary-prevalence error | 0.025 (has_rx_data) |
| Disease-prevalence error (HTN / DM / hyperlipidemia) | 0.015 / 0.004 / 0.010 |
| Spearman correlation Frobenius diff | 2.94 |
| Fraction of synthetic patients with all labs in reference range | reported per cohort in validation_report.json |
Marginals
Spearman correlation structure
Disease prevalence
FHIR endpoints
# Spin up a local read-only FHIR R4 server
syntha serve --bundles examples/sample_output/sample_bundles.ndjson --port 8080
# Then:
curl http://127.0.0.1:8080/metadata # CapabilityStatement
curl http://127.0.0.1:8080/Patient # search-set Bundle
curl http://127.0.0.1:8080/Patient/{id}
curl http://127.0.0.1:8080/\$export # Bulk Data NDJSON
scripts/post_to_fhir.sh posts every transaction Bundle in an NDJSON file to any FHIR R4 endpoint (default: the public HAPI test server).
Turkish cohort + Turkish output
The trained models bundled with the desktop app and the example output come from pristine_strict_episodes.csv and pristine_tolerant_episodes.csv โ anonymized retrospective EHR episodes from a Turkish patient cohort selected to represent clinically pristine adults. The source CSVs themselves are gitignored and never redistributed.
The output is Turkish-localized:
- Patient names sampled from Turkish given-name and family-name distributions (
src/syntha/locale/turkish.py). - Addresses use Turkish cities weighted by approximate population with ISO 3166-2:TR province codes.
- Every Condition carries both an English SNOMED display and a clinical-Turkish translation in
Condition.code.text. Patient.communication.languageistr.
All clinical terminology used (LOINC, SNOMED CT, ICD-10, RxNorm) comes from open international standards. No licensed terminology content is embedded.
Synthea-style clinical modules
Nine modules ship out of the box (src/syntha/modules/); each fires on its corresponding comorbidity flag.
| Module | Source flag(s) | Emits |
|---|---|---|
| ๐ซ Hypertension | Hipertansiyon |
Encounter, 1โ2 antihypertensives (stage 2 โ dual), CarePlan |
| ๐ฌ Diabetes | DM_Tum, DM_Komplikasyonlu |
Encounter, HbA1c, metformin (+ insulin if severe), CarePlan |
| ๐ง Hyperlipidemia | Hiperlipidemi |
Encounter, lipid panel, statin (high-intensity if LDL โฅ 190) |
| ๐ฆ Thyroid | Tiroid |
Encounter, TSH, levothyroxine |
| ๐ Depression | Depresyon |
Psych encounter, sertraline, CBT CarePlan |
| ๐ฐ Anxiety | Anksiyete |
Psych encounter, escitalopram (or buspirone if already on an SSRI) |
| โค๏ธ Ischemic heart disease | Iskemik_Kalp |
Cardiology encounter, ECG, aspirin + ฮฒ-blocker + statin |
| ๐ฌ๏ธ Asthma | Astim |
Resp encounter, spirometry, SABA + ICS |
| ๐ญ COPD | COPD |
Resp encounter, spirometry, LABA + SABA |
Module authoring guide: docs/MODULES.md.
Architecture
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ
โ Source CSV โโโโถโ Gaussian copula โโโโถโ Physiologic filter โ
โ (Turkish โ โ (mixed-type ฯ; โ โ (BP, Friedewald, โ
โ pristine) โ โ nearest-PSD) โ โ eGFR โ creatinine) โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโฌโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Longitudinal โ (optional) โ Single-encounter CSV + โ
โ expansion โ โโโโโโโโโโโโโโโโถโ FHIR R4 export with โ
โ (drift, Poisson) โ โ module activation โ
โโโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
(same FHIR export)
Full math (mixed-type correlation, nearest-PSD projection, conditional missingness, AR(1) lab drift): docs/ARCHITECTURE.md.
CLI reference
| Command | What |
|---|---|
syntha generate |
End-to-end: train copula โ sample โ modules โ CSV + FHIR + model card + validation |
syntha fit |
Fit and persist a copula in a registry without sampling |
syntha sample |
Raw sampling from a registered model |
syntha sample-conditional |
AST-validated rejection sampling against a pandas filter expression |
syntha fhir |
Convert an existing synthetic CSV to FHIR R4 bundles |
syntha validate |
KS / Wasserstein / correlation diff + reference-range coverage |
syntha audit |
Privacy audit (membership-inference + attribute-inference) |
syntha serve |
Read-only FHIR R4 demo server |
syntha export-model |
Export a registered copula to v2 JSON for the desktop app |
syntha list-models, show-card |
Inspect the registry |
Run syntha <cmd> --help for full option lists.
Example output
A pretty-printed sample Bundle, a 100-episode synthetic CSV, the model card, and the validation report all live under examples/sample_output/ and are tracked in git.
| File | What |
|---|---|
sample_bundle_pretty.json |
One pretty-printed transaction Bundle |
sample_bundles.ndjson |
100 Bundles, one per line (Bulk-FHIR style) |
sample_episodes.csv |
100 synthetic episodes matching the input schema |
sample_model_card.json |
source_sha256, n_train, marginals, top correlations |
sample_validation_report.json |
KS / Wasserstein / correlation-Frobenius per column |
For FHIR-aware rendering: drop the Bundle onto simplifier.net or the HL7 Clinical FHIR Renderer.
What it is not
- Not privacy-proof. Gaussian copulas are not differentially private. Run
syntha auditbefore sharing any synthetic dataset trained on a small or sensitive cohort. - Not a substitute for real PHI when validity hinges on rare events โ the copula reproduces the bulk of the joint distribution, not the long tails.
- Not a population-representative Turkish cohort by default โ the source is selected for clinically-pristine adults, so synthetic disease prevalence is lower than TรฤฐK national figures. Calibration to TรฤฐK is a curation task โ see ROADMAP.md and COLLABORATE.md for how to help.
Contributing + collaboration
Open-source, Apache 2.0, contributions welcome from clinicians, data scientists, and software engineers alike. Three places to start:
- ๐งโโ๏ธ Clinicians โ see COLLABORATE.md for the live list of tasks needing clinical-Turkish guidance (drug calibration, ICD specificity, new modules), plus the in-app Collaborate panel that surfaces the same list with one-click "claim" via your GitHub handle.
- ๐ป Developers โ CONTRIBUTING.md for dev setup, commit conventions, and the test matrix.
- ๐บ๏ธ Project direction โ ROADMAP.md for the staged plan, what's shipped, and what's queued.
License + citation
Apache 2.0 ยฉ 2026 Ariorad Moniri โ see LICENSE. If you use syntha in academic work, please cite:
Moniri, A. (2026). syntha: hybrid synthetic patient record generator
trained on Turkish pristine-healthy EHR cohorts.
https://github.com/ArioMoniri/syntha
Acknowledgements
| Project | What it gives us | |
|---|---|---|
| ๐ฉบ | Synthea | Inspiration for the clinical-module layer and FHIR output format |
| ๐งช | LOINC | Lab and observation codes |
| ๐งฌ | SNOMED CT | Condition, procedure, encounter, and care-plan terminology |
| ๐ | ICD-10 | Diagnosis coding alongside SNOMED |
| ๐ | RxNorm | Medication coding |
| ๐ | Turkish-cohort EHR data steward | De-identified retrospective episodes (anonymized upstream; never redistributed by this repo) |
Contributors
Want to be on this list? See COLLABORATE.md or pick a card in the in-app Collaborate panel.
|
Ariorad Moniri ๐งโ๐ผ ๐ป ๐จ ๐ ๐ง ๐ค ๐ ๐ โ ๏ธ |
Powered by all-contributors โ comment @all-contributors please add @username for code,doc on any issue or PR to nominate someone.
Community
|
๐ฌ Discussions
Open questions, "is this the right tool for X?", show-and-tell |
๐ Issues
Bug reports + feature requests + clinical curation |
๐ค Collaborate
Live list of clinician + dev + data tasks ยท also surfaced in the desktop app |
|
๐ Contributing Dev setup, commit conventions, test matrix |
๐บ๏ธ Roadmap Shipped + queued + what needs a clinician |
๐ Changelog Semver, Keep-a-Changelog, generated by release-please |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file syntha_ehr-0.5.8.tar.gz.
File metadata
- Download URL: syntha_ehr-0.5.8.tar.gz
- Upload date:
- Size: 85.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42aba811150998662f4f322cf7edc3adbc2c00c7deb1888208517494dc8d61ce
|
|
| MD5 |
91776ccec4c0dd17e7a2404135091499
|
|
| BLAKE2b-256 |
00b5cd384d90404e32cdd95856c673212e3571abee3112ceb78c05f9c7c3e0e9
|
Provenance
The following attestation bundles were made for syntha_ehr-0.5.8.tar.gz:
Publisher:
pypi-publish.yml on ArioMoniri/syntha
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
syntha_ehr-0.5.8.tar.gz -
Subject digest:
42aba811150998662f4f322cf7edc3adbc2c00c7deb1888208517494dc8d61ce - Sigstore transparency entry: 1545651563
- Sigstore integration time:
-
Permalink:
ArioMoniri/syntha@14d14beae5c8c5be012a526dd189a42093857cef -
Branch / Tag:
refs/tags/v0.5.8 - Owner: https://github.com/ArioMoniri
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@14d14beae5c8c5be012a526dd189a42093857cef -
Trigger Event:
push
-
Statement type:
File details
Details for the file syntha_ehr-0.5.8-py3-none-any.whl.
File metadata
- Download URL: syntha_ehr-0.5.8-py3-none-any.whl
- Upload date:
- Size: 76.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fdc51e81989b617705846117084835b83bcbd31b8ee729810081040efbd39ed2
|
|
| MD5 |
cefd4ae089b2cd1fd606979f7537af22
|
|
| BLAKE2b-256 |
d20851e7035cac1d0f4090040cc13614c7063e6e4e804897b013804b754ac71c
|
Provenance
The following attestation bundles were made for syntha_ehr-0.5.8-py3-none-any.whl:
Publisher:
pypi-publish.yml on ArioMoniri/syntha
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
syntha_ehr-0.5.8-py3-none-any.whl -
Subject digest:
fdc51e81989b617705846117084835b83bcbd31b8ee729810081040efbd39ed2 - Sigstore transparency entry: 1545651680
- Sigstore integration time:
-
Permalink:
ArioMoniri/syntha@14d14beae5c8c5be012a526dd189a42093857cef -
Branch / Tag:
refs/tags/v0.5.8 - Owner: https://github.com/ArioMoniri
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@14d14beae5c8c5be012a526dd189a42093857cef -
Trigger Event:
push
-
Statement type: