Skip to main content

Data quality and characterization metrics for Cumulus

Project description

Data Metrics

A Cumulus-based implementation of the qualifier metrics.

Implemented Metrics

The following qualifier metrics are implemented (per June 2024 qualifer definitions).

* These are US Core profile-based metrics, and the following profiles are not yet implemented:

  • Implantable Device (due to the difficulty in identify implantable records)
  • The various Vital Signs sub-profiles like Blood Pressure (just haven't gotten around to them yet)

Installing

pip install cumulus-library-data-metrics

Running the Metrics

These metrics are designed as a Cumulus Library study and are run using the cumulus-library command.

Local Ndjson

First, you'll want to organize your ndjson into the following file tree format:

root/
  condition/
    my-conditions.ndjson
  medicationrequest/
    1.ndjson
    2.ndjson
  patient/
    Patient.ndjson

(This is the same format that Cumulus ETL writes out when using --output-format=ndjson.)

Here's a sample command to run against that pile of ndjson data:

cumulus-library build \
  --db-type duckdb \
  --database output-tables.db \
  --load-ndjson-dir path/to/ndjson/root \
  --target data_metrics

And then you can load output-tables.db in a DuckDB session and see the results. Or read below to export the counts tables.

Athena

Here's a sample command to run against your Cumulus data in Athena:

cumulus-library build \
  --database your-glue-database \
  --workgroup your-athena-workgroup \
  --profile your-aws-credentials-profile \
  --target data_metrics

And then you can see the resulting tables in Athena. Or read below to export the counts tables.

Exporting Counts

For the metrics that have exportable counts (the characterization metrics mostly), you can easily export those using Cumulus Library, by replacing build in the above commands with export ./output-folder. Like so:

cumulus-library export \
  ./output-folder \
  --db-type duckdb \
  --database output-tables.db \
  --target data_metrics

Aggregate counts

This study generates CUBE output by default. If it's easier to work with simple aggregate counts of every value combination (that is, without the partial value combinations that CUBE() generates), run the build step with --option output-mode:aggregate.

That is, run it like:

cumulus-library build --option output-mode:aggregate ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cumulus_library_data_metrics-5.0.0.tar.gz (64.3 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file cumulus_library_data_metrics-5.0.0.tar.gz.

File metadata

File hashes

Hashes for cumulus_library_data_metrics-5.0.0.tar.gz
Algorithm Hash digest
SHA256 6b126cae7dc456955ea9ca30e36c5ebbae356b988e3a88f2f5e18e84a789e7f0
MD5 00f1663728de071e110a07e183e46c7c
BLAKE2b-256 821f937bc5b4e976f1eb7e08ed1fa1ef0d34680235411f50ddbce96b8c1d6f98

See more details on using hashes here.

File details

Details for the file cumulus_library_data_metrics-5.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for cumulus_library_data_metrics-5.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cd31735f6c2b434b8c14abd0f16488e27cc9e2b33aa366684a6591f5bf645181
MD5 17e7084d36ba148a7332c9e5651225c8
BLAKE2b-256 e2ae8e682c0c4ddf0bc33131245f05052ce0d9c9012f8ed65e82d5859651e1e6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page