Skip to main content

Language-controlled generation of artificial patient records

Project description

neopatient: language-controlled generation of artificial patient records

neopatient generation pipeline


neopatient generates useful, realistic (but artificial), longitudinal patient records. Just write out (in natural language) what you do and do not want the patients to be like. neopatient handles steps like sampling, chunking, batching, structuring, and verification. It cost-effectively generates lots (tens of thousands) of records, each up to 100K+ tokens, in the MEDS format.

Unlike rule-based generators like Synthea, patient trajectories are language-controlled -- producing a new kind of cohort doesn't require writing simulation code or state machines, just a description.

Pipeline

Given a positive description (what the patients should be like) and a negative description (what they should not be like), neopatient runs a multi-stage pipeline:

  1. Sampling -- An LLM generates individualized "patient recipes" (demographics, temporal segments, event densities) that match your descriptions, guided by reference statistics from real record distributions.
  2. Generation -- For each recipe, an LLM generates longitudinal medical events across temporal segments, with appropriate code systems (SNOMED, ICD-10, LOINC, RxNorm, CPT, etc.).
  3. Matching -- Free-text event descriptions are matched to real medical codes via a precomputed vector database (ChromaDB + embeddings over standard terminologies).
  4. Verification -- An LLM checks each completed record against the original positive/negative descriptions, filtering out records that don't meet the specification.

The output is a Parquet file in MEDS format. For large cohorts, neopatient uses LLM batch APIs with a state file for resumability.

Usage

Generate a single patient:

# make sure OPENAI_API_KEY is set
uvx neopatient single \
  --positive "Adult patient with type 2 diabetes managed with metformin, with at least 5 years of follow-up" \
  --negative "Patient with type 1 diabetes or gestational diabetes" \
  --out patient.parquet

Generate a cohort of patients:

uvx neopatient cohort \
  --positive "Adult patient with type 2 diabetes managed with metformin, with at least 5 years of follow-up" \
  --negative "Patient with type 1 diabetes or gestational diabetes" \
  --size 1000 \
  --state-file state.json \
  --out cohort.parquet

The --state-file tracks pipeline progress, so if a long-running job is interrupted, rerunning the same command resumes where it left off.

Use --record-type to choose between ehr-inpatient, ehr-outpatient (default), and claims, which determines the available code systems and timestamp precision. Use --generator, --sampler, and --verifier to pick models for each pipeline stage.

Vector database

The matching stage relies on a precomputed vector database of medical codes. A default database (embedded with Qwen3-Embedding-8B) is downloaded automatically from Hugging Face on first use. To build your own from a parquet file of codes and descriptions:

uvx --from neopatient neopatient-db --parquet_path codes.parquet --db_dir ./my_db

Then pass --db_dir ./my_db to neopatient.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neopatient-0.1.0.tar.gz (480.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

neopatient-0.1.0-py3-none-any.whl (489.5 kB view details)

Uploaded Python 3

File details

Details for the file neopatient-0.1.0.tar.gz.

File metadata

  • Download URL: neopatient-0.1.0.tar.gz
  • Upload date:
  • Size: 480.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for neopatient-0.1.0.tar.gz
Algorithm Hash digest
SHA256 474fec24a1753f910222ccc857587d36250bb421d0da1964f7bc3130cba38bd4
MD5 0a775b0a190f98751d6e4687c87ce79b
BLAKE2b-256 1825566a53614b88e1bc17835970e367600e89e2c73e6d1ece890a41de08e973

See more details on using hashes here.

File details

Details for the file neopatient-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: neopatient-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 489.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for neopatient-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e6c602b9898d386b99c5403af2e37cae80ae90900d38bd155128d644c4631290
MD5 59df36896efc2342bcc649897b167d2c
BLAKE2b-256 2ef31b6492ba59cda811d14b228d3585cad0702696a259b3fbf7e18f35389fca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page