Synthetic medical record generator with realistic schema variance across locales

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

Alechko375

These details have not been verified by PyPI

Project description

MedSynth

Multi-lingual synthetic healthcare data generator. Produces realistic medical records with intentional OCR artifacts and schema variance — simulating real-world messy healthcare data.

The Problem

Healthcare AI development is bottlenecked by data access.

Real patient records are legally restricted (HIPAA, GDPR, Uruguay's Ley 18.331), expensive to anonymize, and nearly impossible to share across borders. Researchers spend months navigating data access before writing a single line of AI code.

Meanwhile, most synthetic data generators produce clean, English-only records that look nothing like actual hospital data — which is scanned paper, multi-lingual, inconsistently formatted, and full of OCR errors.

MedSynth generates data that looks like the real thing — including the mess.

What Makes This Different

Feature	MedSynth	Typical Generators
Languages	6 locales (Hebrew, Arabic, Spanish)	English only
OCR artifacts	Realistic scan errors per script	Clean text
Schema variance	Different formats per facility	Single schema
ID systems	Country-specific (Teudat Zehut, CURP, DNI)	Generic
Privacy	Zero real patient data	Often derived from real records

OCR Realism

Real medical records are scanned paper. MedSynth simulates actual scanning artifacts:

Arabic: Dot-group confusions (ب↔ت↔ث), tashkeel stripping
Hebrew: Shape-based confusions (ר↔ד, ח↔כ)
Latin: rn→m merges, diacritic loss (ñ→n), 0↔O swaps

Schema Variance

Different hospitals format records differently. MedSynth produces variant schemas across facilities so AI systems learn to handle real-world inconsistency — not just clean demos.

Installation

pip install e2llm-medsynth

Quick Start

pip install e2llm-medsynth

# Structured data only (no LLM needed)
medsynth --locale he_IL --num-patients 10 --skip-freetext -v

# With free text via Ollama (default — no API key needed)
ollama pull llama4:maverick
medsynth --locale he_IL --num-patients 10 -v

Free text generation uses any OpenAI-compatible API. Default: Ollama + Llama 4 Maverick (local). No API key needed for basic generation or local Ollama.

Output Format

MedSynth outputs NDJSON files — one per facility × document type:

output/
├── medical_alon_discharge.ndjson
├── medical_alon_lab.ndjson
├── medical_alon_referral.ndjson
├── medical_hadarim_discharge.ndjson
├── medical_hadarim_visit.ndjson
├── ...

Example: The Mess

Alon hospital — digital, English field names:

{"patient_id": "165667015", "patient_name": "משה אזולאי", "patient_age": 77, "gender": "male", "document_date": "2023-07-01", "facility_name": "בית חולים האלון", "conditions": ["השמנת יתר", "דיכאון", "COPD"], "smoking_status": true, "department": "אורולוגיה", "primary_diagnosis": "השמנת יתר", "doc_type": "discharge"}

Hadarim hospital — OCR source, Hebrew field names, different ID type:

{"מספר_זהות": 161559406, "שם_מטופל": "יעל גולן", "גיל": 31, "מין": "female", "תאריך": "29/04/2024", "מוסד_רפואי": "מרכז רפואי הדרים", "מחלות_רקע": ["סוכרת סוג 2", "אי ספיקת כליות כרונית"], "מחלקה": "פנימית א", "אבחנה_ראשית": "סוכרת סוג 2", "doc_type": "discharge"}

Different field names (patient_id → מספר_זהות), different date format (2023-07-01 → 29/04/2024), ID as integer instead of string.

Saudi Arabia — Arabic fields, age as range string:

{"رقم_الهوية": 1496965326, "الاسم": "عبدالرحمن بن راشد الأحمدي", "العمر": "50-60", "الجنس": "male", "التاريخ": "2023-06", "المركز": "مركز الرعاية الصحية الأولية", "الأمراض": ["فرط شحميات الدم"], "التشخيص": "فرط شحميات الدم", "doc_type": "discharge"}

Age stored as range string ("50-60" not 57), date truncated to month ("2023-06").

Mexico — CURP national ID, Spanish field names:

{"patient_id": "AULJ460528MDFGPN03", "patient_name": "Juana Aguilar Figueroa", "patient_age": 77, "gender": "female", "document_date": "2024-01-10", "facility_name": "Hospital Nacional del Norte", "conditions": ["insuficiencia renal crónica", "obesidad", "gota"], "department": "oncología", "doc_type": "discharge"}

18-character CURP encodes name, DOB, gender, and state — completely different from Israeli 9-digit Luhn IDs.

CLI Usage

# Default: Ollama + Llama 4 Maverick (local, no API key)
medsynth --locale he_IL --num-patients 500 --seed 42 -v

# Structured data only — no LLM needed
medsynth --locale es_MX --num-patients 50 --seed 42 --skip-freetext -v

# OpenAI GPT-4o
export LLM_API_KEY="sk-..."
medsynth --api-base https://api.openai.com/v1 --model gpt-4o -v

# Moonshot Kimi K2
export LLM_API_KEY="your-moonshot-key"
medsynth --api-base https://api.moonshot.ai/v1 --model kimi-k2-0711-preview -v

# Anthropic Claude Haiku (via LiteLLM or any OpenAI-compatible proxy)
medsynth --api-base http://localhost:4000/v1 --model claude-haiku-4-5 -v

Options

Flag	Default	Description
`--locale`	`he_IL`	Locale code
`--num-patients`	`500`	Number of patients to generate
`--seed`	`42`	Random seed for reproducibility
`--output-dir`	`output`	Output directory for NDJSON files
`--model`	`llama4:maverick`	LLM model name
`--api-base`	`http://localhost:11434/v1`	API base URL (any OpenAI-compatible endpoint)
`--api-key`	—	API key (or set `LLM_API_KEY` / `OPENAI_API_KEY` env var)
`--skip-freetext`	off	Skip LLM calls for free text
`--force`	off	Overwrite existing output files
`-v` / `--verbose`	off	Verbose output

Python API

from medsynth import generate_documents, load_locale

# Generate documents (default: Ollama + Llama 4 Maverick)
counts = generate_documents(
    num_patients=50,
    seed=42,
    output_dir="output",
    locale_code="es_ES",
    skip_freetext=True,  # set False to generate free text via LLM
    verbose=True,
)

# Use a different provider
counts = generate_documents(
    num_patients=50,
    seed=42,
    output_dir="output",
    model="gpt-4o",
    api_base="https://api.openai.com/v1",
    api_key="sk-...",
    locale_code="es_ES",
)

# Load a locale directly
locale = load_locale("ar_SA")
print(locale.code, len(locale.facilities))

Supported Locales

Code	Region	Script	Facilities
`he_IL`	Israel	Hebrew	Alon, Hadarim, Shaked, Ofek
`ar_SA`	Saudi Arabia	Arabic	Riyadh Medical City, Royal Military, PHC, Al Hayat Labs
`ar_EG`	Egypt	Arabic	Nile Central, Delta University, Tahrir, Al Mokhtabar
`es_ES`	Spain	Latin	Reina Ficticia, San Rafael, Atencion Primaria, Iberia Labs
`es_MX`	Mexico	Latin	Nacional del Norte, Federal del Centro, Centro de Salud, Azteca Labs
`es_AR`	Argentina	Latin	Hospital del Plata, San Martin, CAPS, Austral Labs

Sample Data

Pre-generated sample data (50 patients, seed 42) ships with the package:

from importlib.resources import files

sample_dir = files("medsynth") / "sample_data" / "he_IL"

Tests

pip install -e ".[dev]"
pytest tests/ -v

Use Cases

Healthcare NLP testing — validate extraction pipelines against known-correct synthetic records
AI agent development — train/test agents that query unstructured medical text
OCR pipeline validation — test document understanding against realistic scan artifacts
Cross-border healthcare IT — test systems handling multiple languages/formats
Compliance testing — validate anonymization systems with synthetic ground truth
Education — teach healthcare informatics without privacy concerns

Who We Are

e2llm — healthcare data intelligence.

We build systems that make unstructured medical data queryable: document understanding (OCR → structured), semantic search (natural language → patient cohorts), and multi-lingual medical NLP. Working with healthcare organizations across MENA and Latin America.

Contact

Email: info@e2llm.com
For: Custom locale development, integration with production pipelines, air-gapped deployment consulting, enterprise support

Contributing

PRs welcome. See issues for open tasks.

Disclaimer

MedSynth is an independent open-source project by e2llm. It is not affiliated with, endorsed by, or related to any company or entity operating under the same or a similar name. Any resemblance in naming is purely coincidental.

This tool generates entirely synthetic data for software testing, demos, and research. No real patient information is used or produced. Facility names are fictional — inspired by real institutions for realism, but all generated records are entirely synthetic.

This is not medical software and must not be used for clinical decisions.

Free text generation calls an LLM API. The default (Ollama) runs locally at no cost. When using cloud providers (OpenAI, Moonshot, Anthropic), review their usage policies and be aware of associated costs.

License

MIT

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

Alechko375

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Feb 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

e2llm_medsynth-0.1.0.tar.gz (1.1 MB view details)

Uploaded Feb 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

e2llm_medsynth-0.1.0-py3-none-any.whl (1.2 MB view details)

Uploaded Feb 20, 2026 Python 3

File details

Details for the file e2llm_medsynth-0.1.0.tar.gz.

File metadata

Download URL: e2llm_medsynth-0.1.0.tar.gz
Upload date: Feb 20, 2026
Size: 1.1 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for e2llm_medsynth-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`29daf830e69fb8e8bd6625bd7dea94045b5eeecd9c86baeb19ee8ab56ab9e0b5`
MD5	`787893c91f5eb38b8c53884cb8626bf4`
BLAKE2b-256	`23ca955150820782df7441c183dc9a88ede7e7f3f93057f5c78e2309d8af79bf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for e2llm_medsynth-0.1.0.tar.gz:

Publisher: publish.yml on e2llm/medsynth

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: e2llm_medsynth-0.1.0.tar.gz
- Subject digest: 29daf830e69fb8e8bd6625bd7dea94045b5eeecd9c86baeb19ee8ab56ab9e0b5
- Sigstore transparency entry: 975019923
- Sigstore integration time: Feb 20, 2026
Source repository:
- Permalink: e2llm/medsynth@f1379e597aeb07297eb97d87df328f73c317803b
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/e2llm
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f1379e597aeb07297eb97d87df328f73c317803b
- Trigger Event: release

File details

Details for the file e2llm_medsynth-0.1.0-py3-none-any.whl.

File metadata

Download URL: e2llm_medsynth-0.1.0-py3-none-any.whl
Upload date: Feb 20, 2026
Size: 1.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for e2llm_medsynth-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`22125c4f37eaa66e5dba3900a8240d6fc4030c8a6b6e617430a5976ddc225244`
MD5	`496557018b0eaf8a90c93a62cfa79782`
BLAKE2b-256	`31c171fe9505004fc1163adad272a9611eefdc958a64f4ab8f70862d517005bc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for e2llm_medsynth-0.1.0-py3-none-any.whl:

Publisher: publish.yml on e2llm/medsynth

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: e2llm_medsynth-0.1.0-py3-none-any.whl
- Subject digest: 22125c4f37eaa66e5dba3900a8240d6fc4030c8a6b6e617430a5976ddc225244
- Sigstore transparency entry: 975019927
- Sigstore integration time: Feb 20, 2026
Source repository:
- Permalink: e2llm/medsynth@f1379e597aeb07297eb97d87df328f73c317803b
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/e2llm
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f1379e597aeb07297eb97d87df328f73c317803b
- Trigger Event: release

e2llm-medsynth 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

MedSynth

The Problem

What Makes This Different

OCR Realism

Schema Variance

Installation

Quick Start

Output Format

Example: The Mess

CLI Usage

Options

Python API

Supported Locales

Sample Data

Tests

Use Cases

Who We Are

Contact

Contributing

Disclaimer

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance