Synthetic medical record generator with realistic schema variance across locales
Project description
MedSynth
Multi-lingual synthetic healthcare data generator. Produces realistic medical records with intentional OCR artifacts and schema variance — simulating real-world messy healthcare data.
The Problem
Healthcare AI development is bottlenecked by data access.
Real patient records are legally restricted (HIPAA, GDPR, Uruguay's Ley 18.331), expensive to anonymize, and nearly impossible to share across borders. Researchers spend months navigating data access before writing a single line of AI code.
Meanwhile, most synthetic data generators produce clean, English-only records that look nothing like actual hospital data — which is scanned paper, multi-lingual, inconsistently formatted, and full of OCR errors.
MedSynth generates data that looks like the real thing — including the mess.
What Makes This Different
| Feature | MedSynth | Typical Generators |
|---|---|---|
| Languages | 6 locales (Hebrew, Arabic, Spanish) | English only |
| OCR artifacts | Realistic scan errors per script | Clean text |
| Schema variance | Different formats per facility | Single schema |
| ID systems | Country-specific (Teudat Zehut, CURP, DNI) | Generic |
| Privacy | Zero real patient data | Often derived from real records |
OCR Realism
Real medical records are scanned paper. MedSynth simulates actual scanning artifacts:
- Arabic: Dot-group confusions (ب↔ت↔ث), tashkeel stripping
- Hebrew: Shape-based confusions (ר↔ד, ח↔כ)
- Latin: rn→m merges, diacritic loss (ñ→n), 0↔O swaps
Schema Variance
Different hospitals format records differently. MedSynth produces variant schemas across facilities so AI systems learn to handle real-world inconsistency — not just clean demos.
Installation
pip install e2llm-medsynth
Quick Start
pip install e2llm-medsynth
# Structured data only (no LLM needed)
medsynth --locale he_IL --num-patients 10 --skip-freetext -v
# With free text via Ollama (default — no API key needed)
ollama pull llama4:maverick
medsynth --locale he_IL --num-patients 10 -v
Free text generation uses any OpenAI-compatible API. Default: Ollama + Llama 4 Maverick (local). No API key needed for basic generation or local Ollama.
Output Format
MedSynth outputs NDJSON files — one per facility × document type:
output/
├── medical_alon_discharge.ndjson
├── medical_alon_lab.ndjson
├── medical_alon_referral.ndjson
├── medical_hadarim_discharge.ndjson
├── medical_hadarim_visit.ndjson
├── ...
Example: The Mess
Alon hospital — digital, English field names:
{"patient_id": "165667015", "patient_name": "משה אזולאי", "patient_age": 77, "gender": "male", "document_date": "2023-07-01", "facility_name": "בית חולים האלון", "conditions": ["השמנת יתר", "דיכאון", "COPD"], "smoking_status": true, "department": "אורולוגיה", "primary_diagnosis": "השמנת יתר", "doc_type": "discharge"}
Hadarim hospital — OCR source, Hebrew field names, different ID type:
{"מספר_זהות": 161559406, "שם_מטופל": "יעל גולן", "גיל": 31, "מין": "female", "תאריך": "29/04/2024", "מוסד_רפואי": "מרכז רפואי הדרים", "מחלות_רקע": ["סוכרת סוג 2", "אי ספיקת כליות כרונית"], "מחלקה": "פנימית א", "אבחנה_ראשית": "סוכרת סוג 2", "doc_type": "discharge"}
Different field names (patient_id → מספר_זהות), different date format (2023-07-01 → 29/04/2024), ID as integer instead of string.
Saudi Arabia — Arabic fields, age as range string:
{"رقم_الهوية": 1496965326, "الاسم": "عبدالرحمن بن راشد الأحمدي", "العمر": "50-60", "الجنس": "male", "التاريخ": "2023-06", "المركز": "مركز الرعاية الصحية الأولية", "الأمراض": ["فرط شحميات الدم"], "التشخيص": "فرط شحميات الدم", "doc_type": "discharge"}
Age stored as range string ("50-60" not 57), date truncated to month ("2023-06").
Mexico — CURP national ID, Spanish field names:
{"patient_id": "AULJ460528MDFGPN03", "patient_name": "Juana Aguilar Figueroa", "patient_age": 77, "gender": "female", "document_date": "2024-01-10", "facility_name": "Hospital Nacional del Norte", "conditions": ["insuficiencia renal crónica", "obesidad", "gota"], "department": "oncología", "doc_type": "discharge"}
18-character CURP encodes name, DOB, gender, and state — completely different from Israeli 9-digit Luhn IDs.
CLI Usage
# Default: Ollama + Llama 4 Maverick (local, no API key)
medsynth --locale he_IL --num-patients 500 --seed 42 -v
# Structured data only — no LLM needed
medsynth --locale es_MX --num-patients 50 --seed 42 --skip-freetext -v
# OpenAI GPT-4o
export LLM_API_KEY="sk-..."
medsynth --api-base https://api.openai.com/v1 --model gpt-4o -v
# Moonshot Kimi K2
export LLM_API_KEY="your-moonshot-key"
medsynth --api-base https://api.moonshot.ai/v1 --model kimi-k2-0711-preview -v
# Anthropic Claude Haiku (via LiteLLM or any OpenAI-compatible proxy)
medsynth --api-base http://localhost:4000/v1 --model claude-haiku-4-5 -v
Options
| Flag | Default | Description |
|---|---|---|
--locale |
he_IL |
Locale code |
--num-patients |
500 |
Number of patients to generate |
--seed |
42 |
Random seed for reproducibility |
--output-dir |
output |
Output directory for NDJSON files |
--model |
llama4:maverick |
LLM model name |
--api-base |
http://localhost:11434/v1 |
API base URL (any OpenAI-compatible endpoint) |
--api-key |
— | API key (or set LLM_API_KEY / OPENAI_API_KEY env var) |
--skip-freetext |
off | Skip LLM calls for free text |
--force |
off | Overwrite existing output files |
-v / --verbose |
off | Verbose output |
Python API
from medsynth import generate_documents, load_locale
# Generate documents (default: Ollama + Llama 4 Maverick)
counts = generate_documents(
num_patients=50,
seed=42,
output_dir="output",
locale_code="es_ES",
skip_freetext=True, # set False to generate free text via LLM
verbose=True,
)
# Use a different provider
counts = generate_documents(
num_patients=50,
seed=42,
output_dir="output",
model="gpt-4o",
api_base="https://api.openai.com/v1",
api_key="sk-...",
locale_code="es_ES",
)
# Load a locale directly
locale = load_locale("ar_SA")
print(locale.code, len(locale.facilities))
Supported Locales
| Code | Region | Script | Facilities |
|---|---|---|---|
he_IL |
Israel | Hebrew | Alon, Hadarim, Shaked, Ofek |
ar_SA |
Saudi Arabia | Arabic | Riyadh Medical City, Royal Military, PHC, Al Hayat Labs |
ar_EG |
Egypt | Arabic | Nile Central, Delta University, Tahrir, Al Mokhtabar |
es_ES |
Spain | Latin | Reina Ficticia, San Rafael, Atencion Primaria, Iberia Labs |
es_MX |
Mexico | Latin | Nacional del Norte, Federal del Centro, Centro de Salud, Azteca Labs |
es_AR |
Argentina | Latin | Hospital del Plata, San Martin, CAPS, Austral Labs |
Sample Data
Pre-generated sample data (50 patients, seed 42) ships with the package:
from importlib.resources import files
sample_dir = files("medsynth") / "sample_data" / "he_IL"
Tests
pip install -e ".[dev]"
pytest tests/ -v
Use Cases
- Healthcare NLP testing — validate extraction pipelines against known-correct synthetic records
- AI agent development — train/test agents that query unstructured medical text
- OCR pipeline validation — test document understanding against realistic scan artifacts
- Cross-border healthcare IT — test systems handling multiple languages/formats
- Compliance testing — validate anonymization systems with synthetic ground truth
- Education — teach healthcare informatics without privacy concerns
Who We Are
e2llm — healthcare data intelligence.
We build systems that make unstructured medical data queryable: document understanding (OCR → structured), semantic search (natural language → patient cohorts), and multi-lingual medical NLP. Working with healthcare organizations across MENA and Latin America.
Contact
- Email: info@e2llm.com
- For: Custom locale development, integration with production pipelines, air-gapped deployment consulting, enterprise support
Contributing
PRs welcome. See issues for open tasks.
Disclaimer
MedSynth is an independent open-source project by e2llm. It is not affiliated with, endorsed by, or related to any company or entity operating under the same or a similar name. Any resemblance in naming is purely coincidental.
This tool generates entirely synthetic data for software testing, demos, and research. No real patient information is used or produced. Facility names are fictional — inspired by real institutions for realism, but all generated records are entirely synthetic.
This is not medical software and must not be used for clinical decisions.
Free text generation calls an LLM API. The default (Ollama) runs locally at no cost. When using cloud providers (OpenAI, Moonshot, Anthropic), review their usage policies and be aware of associated costs.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file e2llm_medsynth-0.1.0.tar.gz.
File metadata
- Download URL: e2llm_medsynth-0.1.0.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29daf830e69fb8e8bd6625bd7dea94045b5eeecd9c86baeb19ee8ab56ab9e0b5
|
|
| MD5 |
787893c91f5eb38b8c53884cb8626bf4
|
|
| BLAKE2b-256 |
23ca955150820782df7441c183dc9a88ede7e7f3f93057f5c78e2309d8af79bf
|
Provenance
The following attestation bundles were made for e2llm_medsynth-0.1.0.tar.gz:
Publisher:
publish.yml on e2llm/medsynth
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
e2llm_medsynth-0.1.0.tar.gz -
Subject digest:
29daf830e69fb8e8bd6625bd7dea94045b5eeecd9c86baeb19ee8ab56ab9e0b5 - Sigstore transparency entry: 975019923
- Sigstore integration time:
-
Permalink:
e2llm/medsynth@f1379e597aeb07297eb97d87df328f73c317803b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/e2llm
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f1379e597aeb07297eb97d87df328f73c317803b -
Trigger Event:
release
-
Statement type:
File details
Details for the file e2llm_medsynth-0.1.0-py3-none-any.whl.
File metadata
- Download URL: e2llm_medsynth-0.1.0-py3-none-any.whl
- Upload date:
- Size: 1.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22125c4f37eaa66e5dba3900a8240d6fc4030c8a6b6e617430a5976ddc225244
|
|
| MD5 |
496557018b0eaf8a90c93a62cfa79782
|
|
| BLAKE2b-256 |
31c171fe9505004fc1163adad272a9611eefdc958a64f4ab8f70862d517005bc
|
Provenance
The following attestation bundles were made for e2llm_medsynth-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on e2llm/medsynth
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
e2llm_medsynth-0.1.0-py3-none-any.whl -
Subject digest:
22125c4f37eaa66e5dba3900a8240d6fc4030c8a6b6e617430a5976ddc225244 - Sigstore transparency entry: 975019927
- Sigstore integration time:
-
Permalink:
e2llm/medsynth@f1379e597aeb07297eb97d87df328f73c317803b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/e2llm
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f1379e597aeb07297eb97d87df328f73c317803b -
Trigger Event:
release
-
Statement type: