Skip to main content

Data quality and characterization metrics for Cumulus

Project description

Data Metrics

See qualifier repo for some metric definitions.

Installing

pip install cumulus-library-data-metrics

Running the Metrics

These metrics are designed as a Cumulus Library study and are run using the cumulus-library command.

Local Ndjson

First, you'll want to organize your ndjson into the following file tree format:

root/
  condition/
    my-conditions.ndjson
  medicationrequest/
    1.ndjson
    2.ndjson
  patient/
    Patient.ndjson

(This is the same format that Cumulus ETL writes out when using --output-format=ndjson.)

Here's a sample command to run against that pile of ndjson data:

cumulus-library build \
  --db-type duckdb \
  --database output-tables.db \
  --load-ndjson-dir path/to/ndjson/root \
  --target data_metrics

And then you can load output-tables.db in a DuckDB session and see the results. Or read below to export the counts tables.

Athena

Here's a sample command to run against your Cumulus data in Athena:

cumulus-library build \
  --database your-glue-database \
  --workgroup your-athena-workgroup \
  --profile your-aws-credentials-profile \
  --target data_metrics

And then you can see the resulting tables in Athena. Or read below to export the counts tables.

Exporting Counts

For the metrics that have exportable counts (the characterization metrics mostly), you can easily export those using Cumulus Library, by replacing build in the above commands with export ./output-folder. Like so:

cumulus-library export \
  ./output-folder \
  --db-type duckdb \
  --database output-tables.db \
  --target data_metrics

Aggregate counts

This study generates CUBE output by default. If it's easier to work with simple aggregate counts of every value combination (that is, without the partial value combinations that CUBE() generates), run the build step with DATA_METRICS_OUTPUT_MODE=aggregate in your environment.

That is, run it like:

env \
  DATA_METRICS_OUTPUT_MODE=aggregate \
  cumulus-library build ...

SQL Writing Guidelines

  • Don't depend on core__ tables.

    • Allows folks to build this study even if they can't or haven't built core
    • Allows core to smooth over data oddities we might be interested in
  • Consider looking at macros/logic from Google's analytics if helpful:

Differences from Original Qualifier Metrics

Across the board, we have some minor differences from the upstream metric definitions:

  • We usually stratify a metric by status as well as other fields
  • We drop MedicationAdministration from our metrics - it's not really supported in Cumulus
  • We add support for DiagnosticReport where sensible
  • We consider Observation.effectivePeriod.start and Observation.effectiveInstant in addition to Observation.effectiveDateTime

Other specific deltas will be noted in the code for the given metric.

Implemented Metrics

  • c_pt_count
  • c_pt_deceased_count
  • c_resource_count
  • c_resources_per_pt
  • c_term_coverage
  • c_us_core_v4_count
  • q_date_recent
  • q_ref_target_pop
  • q_ref_target_valid
  • q_term_use
  • q_valid_us_core_v4

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cumulus_library_data_metrics-1.0.0.tar.gz (57.9 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file cumulus_library_data_metrics-1.0.0.tar.gz.

File metadata

File hashes

Hashes for cumulus_library_data_metrics-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ae70fe69c096fe226fa37253e19befb4d1bcc7027f4655a2fd7fedfe1b821a60
MD5 d13a61549637ecddd6a3db9fe60734d0
BLAKE2b-256 6c9a01a1391caaba88c9ad61c937407054e096e7f437e0954628ac4e6baa5c54

See more details on using hashes here.

File details

Details for the file cumulus_library_data_metrics-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for cumulus_library_data_metrics-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9e5b5357f0d65c7b3d222089be9f7d60ccab332752d2d86e0185d68d3881fc18
MD5 3fbde675cf9ee74641ae3ec927f23070
BLAKE2b-256 b6ab29d8ca1d9fcac82c8d84f28ebde92b95bdaaf73406903dfc99ab649789dd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page