Skip to main content

Data quality and characterization metrics for Cumulus

Project description

Data Metrics

See qualifier repo for some metric definitions.

Running the Metrics

These metrics are designed as a Cumulus Library study and are run using the cumulus-library command.

Local Ndjson

First, you'll want to organize your ndjson into the following file tree format:

root/
  condition/
    my-conditions.ndjson
  medicationrequest/
    1.ndjson
    2.ndjson
  patient/
    Patient.ndjson

(This is the same format that Cumulus ETL writes out when using --output-format=ndjson.)

Here's a sample command to run against that pile of ndjson data:

PYTHONPATH=. cumulus-library build \
  --db-type duckdb \
  --database output-tables.db \
  --load-ndjson-dir path/to/ndjson/root \
  --target data_metrics \
  --study-dir .

And then you can load output-tables.db in a DuckDB session and see the results. Or read below to export the counts tables.

Athena

Here's a sample command to run against your Cumulus data in Athena:

PYTHONPATH=. cumulus-library build \
  --database your-glue-database \
  --workgroup your-athena-workgroup \
  --profile your-aws-credentials-profile \
  --target data_metrics \
  --study-dir .

And then you can see the resulting tables in Athena. Or read below to export the counts tables.

Exporting Counts

For the metrics that have exportable counts (the characterization metrics mostly), you can easily export those using Cumulus Library, by replacing build in the above commands with export ./output-folder. Like so:

cumulus-library export \
  ./output-folder \
  --db-type duckdb \
  --database output-tables.db \
  --target data_metrics \
  --study-dir .

Aggregate counts

This study generates CUBE output by default. If it's easier to work with simple aggregate counts of every value combination (that is, without the partial value combinations that CUBE() generates), run the build step with DATA_METRICS_OUTPUT_MODE=aggregate in your environment.

That is, run it like:

env \
  DATA_METRICS_OUTPUT_MODE=aggregate \
  PYTHONPATH=. \
  cumulus-library build ...

SQL Writing Guidelines

  • Don't depend on core__ tables.

    • Allows folks to build this study even if they can't or haven't built core
    • Allows core to smooth over data oddities we might be interested in
  • Consider looking at macros/logic from Google's analytics if helpful:

Differences from Original Qualifier Metrics

Across the board, we have some minor differences from the upstream metric definitions:

  • We usually stratify a metric by status as well as other fields
  • We drop MedicationAdministration from our metrics - it's not really supported in Cumulus
  • We add support for DiagnosticReport where sensible
  • We consider Observation.effectivePeriod.start and Observation.effectiveInstant in addition to Observation.effectiveDateTime

Other specific deltas will be noted in the code for the given metric.

Metric Prioritization

Table stakes quality:

  • q_term_use complies with US Core v1
  • q_ref_target_pop complies with US Core v1 (can be run on partial extracts)
  • q_ref_target_valid complies with US Core v1 (only on full extracts or data lake)
  • q_valid_us_core_v4
    • numerator: resources that don't have all mandatory bits of any profile

Table stakes characterization:

  • c_resource_count (by category, year, month)
  • c_pt_count (by birth year gender, ethnicity, race)
  • c_pt_deceased_count (by gender, by age at death)
  • c_term_coverage (by resource type, by category)
  • c_resources_per_pt (include combinations?)
  • c_us_core_v4_count
    • Tells how many rows match mandatory US Core support
    • And for each separate must-support requirement, tells which rows have the value

High value quality:

  • q_date_sequence
  • q_date_in_lifetime
  • q_date_recent

High value characterization:

  • c_element_use for USCDI v1 “must support” elements
  • c_date_precision (by resource type, by category, by date element, by precision level)
  • c_identifier_coverage (by resource type)

Useful quality:

  • q_obs_value_range
  • q_obs_comp_value_range

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cumulus_library_data_metrics-0.0.0.tar.gz (60.6 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file cumulus_library_data_metrics-0.0.0.tar.gz.

File metadata

File hashes

Hashes for cumulus_library_data_metrics-0.0.0.tar.gz
Algorithm Hash digest
SHA256 b75335fb14a2c8ce5619fa46ceaba635caaaf8f3b55bc6fa6a39688034535505
MD5 ef045b03b124a7ce0fbeda22e6fb78c2
BLAKE2b-256 de6504f2631b9128e181413fa2c8fa3681156b382c8d052be025d2a541b6a8f5

See more details on using hashes here.

File details

Details for the file cumulus_library_data_metrics-0.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for cumulus_library_data_metrics-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f26430fe9e4297074354dd58d3bff634347ee048bb4f233ffa800364a740d4b3
MD5 9b891c6efb345e9dde026bb74facb383
BLAKE2b-256 3da326ced950ff12b790d17d0808eb1a20ff9fad65f45b18a94abaa101e0d439

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page