Skip to main content

Data quality and characterization metrics for Cumulus

Project description

Data Metrics

A Cumulus-based implementation of the qualifier metrics.

Implemented Metrics

The following qualifier metrics are implemented (per June 2024 qualifer definitions).

* These are US Core profile-based metrics, and the following profiles are not yet implemented:

  • Implantable Device (due to the difficulty in identify implantable records)
  • The various Vital Signs sub-profiles like Blood Pressure (just haven't gotten around to them yet)

Installing

pip install cumulus-library-data-metrics

Running the Metrics

These metrics are designed as a Cumulus Library study and are run using the cumulus-library command.

Local Ndjson

First, you'll want to organize your ndjson into the following file tree format:

root/
  condition/
    my-conditions.ndjson
  medicationrequest/
    1.ndjson
    2.ndjson
  patient/
    Patient.ndjson

(This is the same format that Cumulus ETL writes out when using --output-format=ndjson.)

Here's a sample command to run against that pile of ndjson data:

cumulus-library build \
  --db-type duckdb \
  --database output-tables.db \
  --load-ndjson-dir path/to/ndjson/root \
  --target data_metrics

And then you can load output-tables.db in a DuckDB session and see the results. Or read below to export the counts tables.

Athena

Here's a sample command to run against your Cumulus data in Athena:

cumulus-library build \
  --database your-glue-database \
  --workgroup your-athena-workgroup \
  --profile your-aws-credentials-profile \
  --target data_metrics

And then you can see the resulting tables in Athena. Or read below to export the counts tables.

Exporting Counts

For the metrics that have exportable counts (the characterization metrics mostly), you can easily export those using Cumulus Library, by replacing build in the above commands with export ./output-folder. Like so:

cumulus-library export \
  ./output-folder \
  --db-type duckdb \
  --database output-tables.db \
  --target data_metrics

Aggregate counts

This study generates CUBE output by default. If it's easier to work with simple aggregate counts of every value combination (that is, without the partial value combinations that CUBE() generates), run the build step with --option output-mode:aggregate.

That is, run it like:

cumulus-library build --option output-mode:aggregate ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cumulus_library_data_metrics-5.0.1.tar.gz (64.7 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file cumulus_library_data_metrics-5.0.1.tar.gz.

File metadata

File hashes

Hashes for cumulus_library_data_metrics-5.0.1.tar.gz
Algorithm Hash digest
SHA256 2e639c611063d4c75a2963fe9a992e1f202f598c285989d3b78cdd055346f50f
MD5 0167e2a7ebc0ec8992e46abb851d5a6d
BLAKE2b-256 b0fa63ca8c9066285e5ab9289c37db6438957865375e444590dde2d9814a3e84

See more details on using hashes here.

File details

Details for the file cumulus_library_data_metrics-5.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for cumulus_library_data_metrics-5.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 41fbf57e5a2996ec4638a1580e8da3f4ee43348caaa6ab457dd39d6642b36eda
MD5 4225a8e6efea0101610a645206d7da94
BLAKE2b-256 2c6a048cc1485459b808e33a376e1ef5184b58454903bc05c2c9236d54bcd293

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page