Set of tools to harvest, process and uplift (meta)data from metadata providers within the Helmholtz association to be included in the Helmholtz Knowledge Graph (Helmholtz-KG).
Project description
Data Harvesting
This repository contains harvesters, aggregators for linked Data and tools around them. This software allows to harvest small subgraphs exposed by certain sources on the web and and enrich them such that they can be combined to a single larger linked data graph.
This software was written for and is mainly currently deployed as a part of the backend for the unified Helmholtz Information and Data Exchange (unHIDE) project by the Helmholtz Metadata Collaboration (HMC) to create a knowledge graph for the Helmholtz association which allows to monitor, check, enrich metadata as well as identify gabs and needs.
Contributions of any kind by you are always welcome!
Approach:
We establish certain data pipelines of certain data providers with linked metadata and complement it, by combining it with other sources. For the unhide project this data is annotated in schema.org semantics and serialized mainly in JSON-LD.
Data pipelines contain code to execute harvesting from a local to a global level. They are exposed through a cmdline interface (cli) and thus easily integrated in a cron job and can therefore be used to stream data on a time interval bases into some data eco system
Data harvester pipelines so far:
- gitlab pipeline: harvest all public projects in Helmholtz gitlab instances and extracts and complements codemeta.jsonld files. (todo: extend to github)
- sitemap pipeline: extract JSON-LD metadata a data provider over its sitemap, which contains links to the data entries and when they have been last updated
- oai pmh pipeline: extract metadata over oai-pmh endpoints from a data provider. it contains a list of entries and when they where last updated. This pipeline uses a converter from dublin core to schema.org, since many providers provide just dublin core so far.
- datacite pipeline: extract JSON-LD metadata from datacite.org connected to a given organization identifier.
- schoolix pipeline (todo): Extract links and related resources for a list of given PIDs of any kind
Besides the harvesters there are aggregators which allow one to specify how linked data should be processed while tracking the provenance of the processing in a reversible way. This is done by storing graph updates, so called patches, for each subgraph. These updates can also be then applied directly to a graph database. Processes changes can be provided as SPARQL updates or through python function with a specific interface.
All harvesters and Aggregators read from a single config file (as example see configs/config.yaml), which contains als sources and specific operations.
Documentation:
Currently only in code documentation. In the future under the docs folder and hosted somewhere.
Installation
git clone git@codebase.helmholtz.cloud:hmc/hmc-public/unhide/data_harvesting.git
cd data_harvesting
pip install .
as a developer install with
pip install -e .
You can also setup the project using poetry instead of pip.
poetry install --with dev
The individual pipelines have further dependencies outside of python.
For example the gitlab pipeline relies an codemeta-harvester (https://github.com/proycon/codemeta-harvester)
How to use this
For examples look at the examples
folder. Also the tests in tests
folder may provide some insight.
Also once installed there is a command line interface (CLI), 'hmc-unhide' for example one can execute the gitlab pipeline via:
hmc-unhide harvester run --name gitlab --out ~/work/data/gitlab_pipeline
further the cli exposes some other utility on the command line for example to convert linked data files into different formats.
You can also use the CLI to register two pipelines and then run them in parallel. Don't forget to set your prefect server URL.
# register the data pipeline, use any config or out folder path
hmc-unhide pipeline register --config configs/config.yaml --out /opt/data
# register the hifis pipeline
hmc-unhide stats register
License
The software is distributed under the terms and conditions of the MIT license which is specified in the LICENSE
file.
Acknowledgement
This project was supported by the Helmholtz Metadata Collaboration (HMC), an incubator-platform of the Helmholtz Association within the framework of the Information and Data Science strategic initiative.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for data_harvesting-2.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a34c2bf99c5ee66bbbccc99016a8a8030f63e0915916f0c1f00cd44062f0a821 |
|
MD5 | ce223d7730c65f41592402cfbeb38619 |
|
BLAKE2b-256 | 6df3a79f29f96a47503a62dd38ba2c0cc17e9a96787c8333df0a05bc187c28d3 |