A pipeline for processing open medical examier's data using GitHub Actions CI/CD.
Project description
TODO:
- Add A LOT more PRINT statements
- Add comments
- Add documentation (README and docs site)
- The latter will be necesarry once we move to dockerfiles and actions
- Add tests 😅
- including CLI tests
- Figure out Milwaukee data
- Used aiohttp (async) 🙂
- Add drug-extraction step
- Add geocoding step (when lat/long not provided)] --> use method to identify when geocoding is needed (i.e. when lat/long is null in datasets that have lat/long or just when there is no lat/long in the dataset but there is address data)
- Requires lots of specifications in config.yaml
- Pin versions used of all software
- Use arcgis package for geocoding
- Use batch geocoding (had problem with Token... can register as anonymous user?)
- [x] Use Socrata package (register API key) for data fetching from datasets published on Socrata - Use
requests
package for data fetching from datasets published on odata
- Use batch geocoding (had problem with Token... can register as anonymous user?)
- Use github python package to keep config.yaml updated after successful runs
- Can also use to update JS datafiles at end of analysis (see below)
- Just used requests and api directly
- These should be very small and generated by pandas analysis of the data
- results should be in a github release (data files) (can zip them)
- Use GH CLI ion bash script because pre-installed in Actions
- We can then just use the OctoKit JS package to point to the LINKS of the files and when you click on them it will download them
- then web page to enable file downloads and show some graphs (basic --> records over time for each dataset)
- what charting frameowkr to use?
- Need an action to update the frontend codebase with the new data
- Store in JSON format
- [ ] add website to socrata key
- Store in JSON format
- Use pydantic to read config file and .env secrets
- Make a container to run the whole pipeline (so no downloads for users)
- Host on GHCR
- MAKE OUR OWN UNIQUE IDENTIFIERS FOR ALL DATASETS COMBINED
- SAME COLUMN NAME IN ALL DATASETS, THEN DON'T HAVE TO PROVIDE IDENTIFIER COLUMN IN config.yaml
- Also allows for better merging of datasets (i.e. records + drugs + geo)
- [ ] DO we want to publish a Web API as well?- [ ] Then weould need DB
- No Windows support due to drug extraction tool usage
I think, if my math is right, we can do ~20 minutes / day of actions... (2,000 minutes per month limit for free)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
opendata-pipeline-0.1.0.tar.gz
(12.5 kB
view hashes)
Built Distribution
Close
Hashes for opendata_pipeline-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec8ccdb34b5b4831ed0bc0f816ab5fd016c88ea2053111df320d4893fc67c645 |
|
MD5 | 4d63fbbbef7db74edc86748205bb543b |
|
BLAKE2b-256 | 4357f37f3bc01a79605c528de60cfd98da0dca828317a1a6b3fd93184f82f624 |