Skip to main content

A pipeline for processing open medical examier's data using GitHub Actions CI/CD.

Project description

TODO:

  • Add A LOT more PRINT statements
  • Add comments
  • Add documentation (README and docs site)
    • The latter will be necesarry once we move to dockerfiles and actions
  • Add tests 😅
    • including CLI tests
  • Figure out Milwaukee data
    • Used aiohttp (async) 🙂
  • Add drug-extraction step
  • Add geocoding step (when lat/long not provided)] --> use method to identify when geocoding is needed (i.e. when lat/long is null in datasets that have lat/long or just when there is no lat/long in the dataset but there is address data)
    • Requires lots of specifications in config.yaml
  • Pin versions used of all software
  • Use arcgis package for geocoding
    • Use batch geocoding (had problem with Token... can register as anonymous user?) - [x] Use Socrata package (register API key) for data fetching from datasets published on Socrata
    • Use requests package for data fetching from datasets published on odata
  • Use github python package to keep config.yaml updated after successful runs
    • Can also use to update JS datafiles at end of analysis (see below)
    • Just used requests and api directly
      • These should be very small and generated by pandas analysis of the data
  • results should be in a github release (data files) (can zip them)
    • Use GH CLI ion bash script because pre-installed in Actions
    • We can then just use the OctoKit JS package to point to the LINKS of the files and when you click on them it will download them
    • then web page to enable file downloads and show some graphs (basic --> records over time for each dataset)
      • what charting frameowkr to use?
      • Need an action to update the frontend codebase with the new data
        • Store in JSON format - [ ] add website to socrata key
  • Use pydantic to read config file and .env secrets
  • Make a container to run the whole pipeline (so no downloads for users)
    • Host on GHCR
  • MAKE OUR OWN UNIQUE IDENTIFIERS FOR ALL DATASETS COMBINED
    • SAME COLUMN NAME IN ALL DATASETS, THEN DON'T HAVE TO PROVIDE IDENTIFIER COLUMN IN config.yaml
    • Also allows for better merging of datasets (i.e. records + drugs + geo) - [ ] DO we want to publish a Web API as well? - [ ] Then weould need DB
  • No Windows support due to drug extraction tool usage

I think, if my math is right, we can do ~20 minutes / day of actions... (2,000 minutes per month limit for free)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opendata-pipeline-0.1.0.tar.gz (12.5 kB view hashes)

Uploaded Source

Built Distribution

opendata_pipeline-0.1.0-py3-none-any.whl (14.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page