Data quality and characterization metrics for Cumulus
Project description
Data Metrics
See qualifier repo for some metric definitions.
Running the Metrics
These metrics are designed as a
Cumulus Library
study and are run using the cumulus-library
command.
Local Ndjson
First, you'll want to organize your ndjson into the following file tree format:
root/
condition/
my-conditions.ndjson
medicationrequest/
1.ndjson
2.ndjson
patient/
Patient.ndjson
(This is the same format that Cumulus ETL writes out when using --output-format=ndjson
.)
Here's a sample command to run against that pile of ndjson data:
PYTHONPATH=. cumulus-library build \
--db-type duckdb \
--database output-tables.db \
--load-ndjson-dir path/to/ndjson/root \
--target data_metrics \
--study-dir .
And then you can load output-tables.db
in a DuckDB session and see the results.
Or read below to export the counts tables.
Athena
Here's a sample command to run against your Cumulus data in Athena:
PYTHONPATH=. cumulus-library build \
--database your-glue-database \
--workgroup your-athena-workgroup \
--profile your-aws-credentials-profile \
--target data_metrics \
--study-dir .
And then you can see the resulting tables in Athena. Or read below to export the counts tables.
Exporting Counts
For the metrics that have exportable counts (the characterization metrics mostly),
you can easily export those using Cumulus Library,
by replacing build
in the above commands with export ./output-folder
.
Like so:
cumulus-library export \
./output-folder \
--db-type duckdb \
--database output-tables.db \
--target data_metrics \
--study-dir .
Aggregate counts
This study generates CUBE
output by default.
If it's easier to work with simple aggregate counts of every value combination
(that is, without the partial value combinations that CUBE()
generates),
run the build step with DATA_METRICS_OUTPUT_MODE=aggregate
in your environment.
That is, run it like:
env \
DATA_METRICS_OUTPUT_MODE=aggregate \
PYTHONPATH=. \
cumulus-library build ...
SQL Writing Guidelines
-
Don't depend on
core__
tables.- Allows folks to build this study even if they can't or haven't built
core
- Allows
core
to smooth over data oddities we might be interested in
- Allows folks to build this study even if they can't or haven't built
-
Consider looking at macros/logic from Google's analytics if helpful:
Differences from Original Qualifier Metrics
Across the board, we have some minor differences from the upstream metric definitions:
- We usually stratify a metric by status as well as other fields
- We drop MedicationAdministration from our metrics - it's not really supported in Cumulus
- We add support for DiagnosticReport where sensible
- We consider Observation.effectivePeriod.start and Observation.effectiveInstant in addition to Observation.effectiveDateTime
Other specific deltas will be noted in the code for the given metric.
Metric Prioritization
Table stakes quality:
q_term_use
complies with US Core v1q_ref_target_pop
complies with US Core v1 (can be run on partial extracts)q_ref_target_valid
complies with US Core v1 (only on full extracts or data lake)q_valid_us_core_v4
- numerator: resources that don't have all mandatory bits of any profile
Table stakes characterization:
c_resource_count
(by category, year, month)c_pt_count
(by birth year gender, ethnicity, race)c_pt_deceased_count
(by gender, by age at death)c_term_coverage
(by resource type, by category)c_resources_per_pt
(include combinations?)c_us_core_v4_count
- Tells how many rows match mandatory US Core support
- And for each separate must-support requirement, tells which rows have the value
High value quality:
q_date_sequence
q_date_in_lifetime
q_date_recent
High value characterization:
c_element_use
for USCDI v1 “must support” elementsc_date_precision
(by resource type, by category, by date element, by precision level)c_identifier_coverage
(by resource type)
Useful quality:
q_obs_value_range
q_obs_comp_value_range
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cumulus_library_data_metrics-0.0.0.tar.gz
.
File metadata
- Download URL: cumulus_library_data_metrics-0.0.0.tar.gz
- Upload date:
- Size: 60.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b75335fb14a2c8ce5619fa46ceaba635caaaf8f3b55bc6fa6a39688034535505 |
|
MD5 | ef045b03b124a7ce0fbeda22e6fb78c2 |
|
BLAKE2b-256 | de6504f2631b9128e181413fa2c8fa3681156b382c8d052be025d2a541b6a8f5 |
File details
Details for the file cumulus_library_data_metrics-0.0.0-py3-none-any.whl
.
File metadata
- Download URL: cumulus_library_data_metrics-0.0.0-py3-none-any.whl
- Upload date:
- Size: 74.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f26430fe9e4297074354dd58d3bff634347ee048bb4f233ffa800364a740d4b3 |
|
MD5 | 9b891c6efb345e9dde026bb74facb383 |
|
BLAKE2b-256 | 3da326ced950ff12b790d17d0808eb1a20ff9fad65f45b18a94abaa101e0d439 |