No project description provided
Project description
datagov-harvesting-logic
This is a library that will be utilized for metadata extraction, validation, transformation, and loading into the data.gov catalog.
Features
- Extract
- General purpose fetching and downloading of web resources.
- Catered extraction to the following data formats:
- DCAT-US
- Validation
- DCAT-US
jsonschema
validation using draft 2020-12.
- DCAT-US
- Load
- DCAT-US
- Conversion of dcat-us catalog into ckan dataset schema
- Create, delete, update, and patch of ckan package/dataset
- DCAT-US
Requirements
This project is using poetry
to manage this project. Install here.
Once installed, poetry install
installs dependencies into a local virtual environment.
We use Ruff to format and lint our Python files. If you use VS Code, you can install the formatter here.
Testing
CKAN load testing
- CKAN load testing doesn't require the services provided in the
docker-compose.yml
. - catalog-dev is used for ckan load testing.
- Create an api-key by signing into catalog-dev.
- Create a
credentials.py
file at the root of the project containing the variableckan_catalog_dev_api_key
assigned to the api-key. - Run tests with the command
poetry run pytest ./tests/load/ckan
Harvester testing
- These tests are found in
extract
, andvalidate
. Some of them rely on services in thedocker-compose.yml
. Run using dockerdocker compose up -d
and with the commandpoetry run pytest --ignore=./tests/load/ckan
.
If you followed the instructions for CKAN load testing
and Harvester testing
you can simply run poetry run pytest
to run all tests.
Integration testing
- to run integration tests locally add the following env variables to your .env file in addition to their appropriate values
- CF_SERVICE_USER = "put username here"
- CF_SERVICE_AUTH = "put password here"
Comparison
-
./tests/harvest_sources/ckan_datasets_resp.json
- Represents what ckan would respond with after querying for the harvest source name
-
./tests/harvest_sources/dcatus_compare.json
-
Represents a changed harvest source
-
Created:
-
datasets[0]
+ "identifier" = "cftc-dc10"
-
-
Deleted:
-
datasets[0]
- "identifier" = "cftc-dc1"
-
-
Updated:
-
datasets[1]
- "modified": "R/P1M" + "modified": "R/P1M Update"
-
datasets[2]
- "keyword": ["cotton on call", "cotton on-call"] + "keyword": ["cotton on call", "cotton on-call", "update keyword"]
-
datasets[3]
"publisher": { "name": "U.S. Commodity Futures Trading Commission", "subOrganizationOf": { - "name": "U.S. Government" + "name": "Changed Value" } }
-
-
-
./test/harvest_sources/dcatus.json
- Represents an original harvest source prior to change occuring.
Flask App
Local development
-
set your local configurations in
.env
file. -
Use the Makefile to set up local Docker containers, including a PostgreSQL database and the Flask application:
make build make up make test make clean
This will start the necessary services and execute the test.
-
when there are database DDL changes, use following steps to generate migration scripts and update database:
docker compose up -d db docker compose run app flask db migrate -m "migration description" docker compose run app flask db upgrade
Debugging
NOTE: To use the VS-Code debugger, you will first need to sacrifice the reloading support for flask
-
Build new containers with development requirements by running
make build-dev
-
Launch containers by running
make up-debug
-
In VS-Code, launch debug process
Python: Remote Attach
-
Set breakpoints
-
Visit the site at
http://localhost:8080
and invoke the route which contains the code you've set the breakpoint on.
Deployment to cloud.gov
Database Service Setup
A database service is required for use on cloud.gov.
In a given Cloud Foundry space
, a db can be created with
cf create-service <service offering> <plan> <service instance>
.
In dev, for example, the db was created with
cf create-service aws-rds micro-psql harvesting-logic-db
.
Creating databases for the other spaces should follow the same pattern, though the size may need to be adjusted (see available AWS RDS service offerings with cf marketplace -e aws-rds
).
Any created service needs to be bound to an app with cf bind-service <app> <service>
. With the above example, the db can be bound with
cf bind-service harvesting-logic harvesting-logic-db
.
Accessing the service can be done with service keys. They can be created with cf create-service-keys
, listed with cf service-keys
, and shown with
cf service-key <service-key-name>
.
Manually Deploying the Flask Application to development
-
Ensure you have a
manifest.yml
andvars.development.yml
file configured for your Flask application. The vars file may include variables:app_name: harvesting-logic database_name: harvesting-logic-db route-external: harvester-dev-datagov.app.cloud.gov
-
Deploy the application using Cloud Foundry's
cf push
command with the variable file:poetry export -f requirements.txt --output requirements.txt --without-hashes cf push --vars-file vars.development.yml
-
when there are database DDL changes, use following to do the database update:
cf run-task harvesting-logic --command "flask db upgrade" --name database-upgrade
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file datagov_harvesting_logic-0.5.0.tar.gz
.
File metadata
- Download URL: datagov_harvesting_logic-0.5.0.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7cd702b9c482464cdc862f0311af800a67217cac97db09a6eacb8ae30e22056d |
|
MD5 | 556f3da08caf70d467a62b9d9b0e4d25 |
|
BLAKE2b-256 | 9796c4164bb1f63dfbe4b4e723b61e1d16db0d3aa5d8329f9e44953e055c049b |
File details
Details for the file datagov_harvesting_logic-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: datagov_harvesting_logic-0.5.0-py3-none-any.whl
- Upload date:
- Size: 14.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2d2fde7fb3eee2a487347688d1819d366e88df911d0190a9356087cbcd368cb |
|
MD5 | 3457fa0912c196556f3844258d9c3d08 |
|
BLAKE2b-256 | f33f97c14929791a81ef6c02b136dad45461b88bac69a35828a56d344707319d |