Skip to main content

Data quality and characterization metrics for Cumulus

Project description

Data Metrics

A Cumulus-based implementation of the qualifier metrics.

Implemented Metrics

The following qualifier metrics are implemented (per June 2024 qualifer definitions).

* These are US Core profile-based metrics, and the following profiles are not yet implemented:

  • Implantable Device (due to the difficulty in identify implantable records)
  • The various Vital Signs sub-profiles like Blood Pressure (just haven't gotten around to them yet)

Installing

pip install cumulus-library-data-metrics

Running the Metrics

These metrics are designed as a Cumulus Library study and are run using the cumulus-library command.

Local Ndjson

First, you'll want to organize your ndjson into the following file tree format:

root/
  condition/
    my-conditions.ndjson
  medicationrequest/
    1.ndjson
    2.ndjson
  patient/
    Patient.ndjson

(This is the same format that Cumulus ETL writes out when using --output-format=ndjson.)

Here's a sample command to run against that pile of ndjson data:

cumulus-library build \
  --db-type duckdb \
  --database output-tables.db \
  --load-ndjson-dir path/to/ndjson/root \
  --target data_metrics

And then you can load output-tables.db in a DuckDB session and see the results. Or read below to export the counts tables.

Athena

Here's a sample command to run against your Cumulus data in Athena:

cumulus-library build \
  --database your-glue-database \
  --workgroup your-athena-workgroup \
  --profile your-aws-credentials-profile \
  --target data_metrics

And then you can see the resulting tables in Athena. Or read below to export the counts tables.

Exporting Counts

For the metrics that have exportable counts (the characterization metrics mostly), you can easily export those using Cumulus Library, by replacing build in the above commands with export ./output-folder. Like so:

cumulus-library export \
  ./output-folder \
  --db-type duckdb \
  --database output-tables.db \
  --target data_metrics

Aggregate counts

This study generates CUBE output by default. If it's easier to work with simple aggregate counts of every value combination (that is, without the partial value combinations that CUBE() generates), run the build step with --option output-mode:aggregate.

That is, run it like:

cumulus-library build --option output-mode:aggregate ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cumulus_library_data_metrics-5.1.0.tar.gz (64.8 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file cumulus_library_data_metrics-5.1.0.tar.gz.

File metadata

File hashes

Hashes for cumulus_library_data_metrics-5.1.0.tar.gz
Algorithm Hash digest
SHA256 3628a63ec21d7741307595d1caa1c73dfabfb677bc9801861e13d6bb60300457
MD5 5aeaa3a435c5418cc2bd95a48cbf8509
BLAKE2b-256 271e46c5a545f99b11ac0f1d1052639311dd0ff065303881f58e21de57da9d02

See more details on using hashes here.

File details

Details for the file cumulus_library_data_metrics-5.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for cumulus_library_data_metrics-5.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8950c633ab9e219739ae3bb8f15b49d2f06a094afc5c6fd2a2c416b882eb851f
MD5 a27fa1aead3a7bf6c7b50b2810f799af
BLAKE2b-256 c27033223ae79cf4c2b307f2c2242097ca3f94129fa385bd2d1a5a55653c895e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page