Skip to main content

A pipeline for processing open medical examier's data using GitHub Actions CI/CD.

Project description

Medical Examiner Open Data Pipeline

Pipeline Docs

logo

This repository contains the code for the Medical Examiner Open Data Pipeline.

We currently fetch data from the following sources:

The results of this data are used in various other analysis here on GitHub:

  • Cook County
    • Where we add geospatial data to the Cook County data
      • This was excluded from this automated pipeline due to specific requirements for the data for only Cook County

Getting Started

This repo exists mainly to take advantage of GitHub actions for automation.

The actions workflow is located in .github/workflows/pipeline.yml and is triggered weekly or manually.

This workflow fetches data from the configured data sources inside config.json, geocodes addresses (when available) using ArcGIS, extracts drugs using the drug extraction toolbox and then compiles and zips up the results into the GitHub Releases page.

The data is then available for download from the releases page page.

Further, the entire workflow effectively runs a series of commands using the CLI application opendata-pipeline which is located in the src directory.

This is also available via a docker image hosted on ghcr.io. The benefits of using the CLI via a docker image is that you don't have to have Python or the drug toolbox on your local machine 🙂.

We utilize async methods to speed up the large number of web requests we make to the data sources.

It is important to regularly fetch/pull from this repo to maintain an updated config.json

We currently do not guarantee Windows support unfortunately. If you want to help make that a reality, please submit a new Pull Request

There is further API-documentation available on the GitHub Pages website for this repo if you want to interact with the CLI. I would recommend using the docker image as it is easier to use and always referring to the CLI --help for more information.

Workflow

The workflow can best be described by looking at the pipeline.yml file.

CleanShot 2023-01-18 at 10 38 29@2x

Data Enhancements

The following table shows the fields that we add to the original datafiles:

Column Name Description
CaseIdentifier A unique identifier across all the datasets.
death_day Day of the Month death occurred
death_month Month Name death occurred
death_month_num Month Number death occurred
death_year Year death occurred
death_day_of_week Day of week death occurred. Starting with 0 on Monday. Weekends are 5 (Saturday) & 6 (Sunday).
death_day_is_weekend Death occurred on weekend day
death_day_week_of_year Week of the year (of 52) that death occurred
geocoded_latitude Geocoded latitude.
geocoded_longitude Geocoded longitude.
geocoded_score Confidence of geocoding. 70-100.
geocoded_address The address that the geocoded results correspond to. Not the address provided to the geocoder.

Drug Columns

In addition to providing the extracted drugs as a separate file in each release, we also convert this data to wide-form for each dataset. This adds the following columns in the subsequent pattern:

Column Name/Pattern Description
*_1 * drug found in first search column provided in drug configuration
*_2 * drug found in second search column provided in drug configuration
*_meta Drug of * category/class found in this record across any search column.

Requirements

  • uv

Installation

To install the python cli I recommend using uv.

uvx opendata-pipeline

To install the docker image, you can use the following command:

docker pull ghcr.io/uk-ipop/opendata-pipeline:latest

Usage

Usage is very similar to any other command line application. The most important thing is to follow the workflow defined above.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Help me write some tests!

License

MIT

BibTex Citation

If you use this software or the enhanced data, please cite this repository:

@software{Anthony_Medical_Examiner_OpenData_2022,
  author = {Anthony, Nicholas},
  month = {9},
  title = {{Medical Examiner OpenData Pipeline}},
  url = {https://github.com/UK-IPOP/open-data-pipeline},
  version = {0.2.1},
  year = {2022}
}

Thank you.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opendata_pipeline-0.3.3.tar.gz (123.5 kB view details)

Uploaded Source

Built Distribution

opendata_pipeline-0.3.3-py3-none-any.whl (34.3 kB view details)

Uploaded Python 3

File details

Details for the file opendata_pipeline-0.3.3.tar.gz.

File metadata

  • Download URL: opendata_pipeline-0.3.3.tar.gz
  • Upload date:
  • Size: 123.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.1

File hashes

Hashes for opendata_pipeline-0.3.3.tar.gz
Algorithm Hash digest
SHA256 46c20f6cc5bfaa7e83151a13070c5a2e63a36e59cbb7dccb4d17229f365ee867
MD5 4deafae2391789704dfedc1ff204ba2d
BLAKE2b-256 b618639e5e6c15ad8b1c78d0076e62d15884a48cdbdff5fd8e6066f3fb438bb0

See more details on using hashes here.

File details

Details for the file opendata_pipeline-0.3.3-py3-none-any.whl.

File metadata

File hashes

Hashes for opendata_pipeline-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 aa64e9b48c13fad74e61f4c9c95ac73340d53b1a533d9957550e2e98bc8e0c3e
MD5 a8c8fa4251727ed668b2ac841e8a5c5a
BLAKE2b-256 5924827a50adcb53faf303d7fd8ba567665dbdcff6b9e25a3933379e086f585a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page