Skip to main content

Data quality and characterization metrics for Cumulus

Project description

Data Metrics

A Cumulus-based implementation of the qualifier metrics.

Implemented Metrics

The following qualifier metrics are implemented (per May 2024 qualifer definitions).

Installing

pip install cumulus-library-data-metrics

Running the Metrics

These metrics are designed as a Cumulus Library study and are run using the cumulus-library command.

Local Ndjson

First, you'll want to organize your ndjson into the following file tree format:

root/
  condition/
    my-conditions.ndjson
  medicationrequest/
    1.ndjson
    2.ndjson
  patient/
    Patient.ndjson

(This is the same format that Cumulus ETL writes out when using --output-format=ndjson.)

Here's a sample command to run against that pile of ndjson data:

cumulus-library build \
  --db-type duckdb \
  --database output-tables.db \
  --load-ndjson-dir path/to/ndjson/root \
  --target data_metrics

And then you can load output-tables.db in a DuckDB session and see the results. Or read below to export the counts tables.

Athena

Here's a sample command to run against your Cumulus data in Athena:

cumulus-library build \
  --database your-glue-database \
  --workgroup your-athena-workgroup \
  --profile your-aws-credentials-profile \
  --target data_metrics

And then you can see the resulting tables in Athena. Or read below to export the counts tables.

Exporting Counts

For the metrics that have exportable counts (the characterization metrics mostly), you can easily export those using Cumulus Library, by replacing build in the above commands with export ./output-folder. Like so:

cumulus-library export \
  ./output-folder \
  --db-type duckdb \
  --database output-tables.db \
  --target data_metrics

Aggregate counts

This study generates CUBE output by default. If it's easier to work with simple aggregate counts of every value combination (that is, without the partial value combinations that CUBE() generates), run the build step with DATA_METRICS_OUTPUT_MODE=aggregate in your environment.

That is, run it like:

env \
  DATA_METRICS_OUTPUT_MODE=aggregate \
  cumulus-library build ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cumulus_library_data_metrics-2.0.1.tar.gz (59.1 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file cumulus_library_data_metrics-2.0.1.tar.gz.

File metadata

File hashes

Hashes for cumulus_library_data_metrics-2.0.1.tar.gz
Algorithm Hash digest
SHA256 a4eeb7f34941feff18e7c39f3c2b79bf4a93730851fe9fe070aba9fbc1f8b619
MD5 db9c3b22cffd1226154413c609a8bcf1
BLAKE2b-256 12b18adf1904502226220a624d6f6fc642616ad06d3fbe764d8e1cad86a01b1f

See more details on using hashes here.

File details

Details for the file cumulus_library_data_metrics-2.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for cumulus_library_data_metrics-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9d5fa63eacacd99c72e4ea58097ddeaf76d9e0001623d9ad9727aaec786fb3cf
MD5 bff5a59aeca4b35458f861f3964b21a2
BLAKE2b-256 771341d7c2611e81d62f2c119ce27f701a1f5a887c888cf2416750c32eb7e0fd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page