Skip to main content

Set of tools to harvest, process and uplift (meta)data from metadata providers within the Helmholtz association to be included in the Helmholtz Knowledge Graph (Helmholtz-KG).

Project description

MIT license

Data Harvesting

This repository contains harvesters, aggregators for linked Data and tools around them. This software allows to harvest small subgraphs exposed by certain sources on the web and and enrich them such that they can be combined to a single larger linked data graph.

This software was written for and is mainly currently deployed as a part of the backend for the unified Helmholtz Information and Data Exchange (unHIDE) project by the Helmholtz Metadata Collaboration (HMC) to create a knowledge graph for the Helmholtz association which allows to monitor, check, enrich metadata as well as identify gabs and needs.

Contributions of any kind by you are always welcome!

Approach:

We establish certain data pipelines of certain data providers with linked metadata and complement it, by combining it with other sources. For the unhide project this data is annotated in schema.org semantics and serialized mainly in JSON-LD.

Data pipelines contain code to execute harvesting from a local to a global level. They are exposed through a cmdline interface (cli) and thus easily integrated in a cron job and can therefore be used to stream data on a time interval bases into some data eco system

Data harvester pipelines so far:

  • gitlab pipeline: harvest all public projects in Helmholtz gitlab instances and extracts and complements codemeta.jsonld files. (todo: extend to github)
  • sitemap pipeline: extract JSON-LD metadata a data provider over its sitemap, which contains links to the data entries and when they have been last updated
  • oai pmh pipeline: extract metadata over oai-pmh endpoints from a data provider. it contains a list of entries and when they where last updated. This pipeline uses a converter from dublin core to schema.org, since many providers provide just dublin core so far.
  • datacite pipeline: extract JSON-LD metadata from datacite.org connected to a given organization identifier.
  • schoolix pipeline (todo): Extract links and related resources for a list of given PIDs of any kind

Besides the harvesters there are aggregators which allow one to specify how linked data should be processed while tracking the provenance of the processing in a reversible way. This is done by storing graph updates, so called patches, for each subgraph. These updates can also be then applied directly to a graph database. Processes changes can be provided as SPARQL updates or through python function with a specific interface.

All harvesters and Aggregators read from a single config file (as example see configs/config.yaml), which contains als sources and specific operations.

Documentation:

Currently only in code documentation. In the future under the docs folder and hosted somewhere.

Installation

git clone git@codebase.helmholtz.cloud:hmc/hmc-public/unhide/data_harvesting.git
cd data_harvesting
pip install .

as a developer install with

pip install -e .

You can also setup the project using poetry instead of pip.

poetry install --with dev

The individual pipelines have further dependencies outside of python.

For example the gitlab pipeline relies an codemeta-harvester (https://github.com/proycon/codemeta-harvester)

How to use this

For examples look at the examples folder. Also the tests in tests folder may provide some insight. Also once installed there is a command line interface (CLI), 'hmc-unhide' for example one can execute the gitlab pipeline via:

hmc-unhide harvester run --name gitlab --out ~/work/data/gitlab_pipeline

further the cli exposes some other utility on the command line for example to convert linked data files into different formats.

You can also use the CLI to register two pipelines and then run them in parallel. Don't forget to set your prefect server URL.

# register the data pipeline, use any config or out folder path
hmc-unhide pipeline register --config configs/config.yaml --out /opt/data

# register the hifis pipeline
hmc-unhide stats register

License

The software is distributed under the terms and conditions of the MIT license which is specified in the LICENSE file.

Acknowledgement

This project was supported by the Helmholtz Metadata Collaboration (HMC), an incubator-platform of the Helmholtz Association within the framework of the Information and Data Science strategic initiative.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_harvesting-2.0.0.tar.gz (628.0 kB view details)

Uploaded Source

Built Distribution

data_harvesting-2.0.0-py3-none-any.whl (671.7 kB view details)

Uploaded Python 3

File details

Details for the file data_harvesting-2.0.0.tar.gz.

File metadata

  • Download URL: data_harvesting-2.0.0.tar.gz
  • Upload date:
  • Size: 628.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.15

File hashes

Hashes for data_harvesting-2.0.0.tar.gz
Algorithm Hash digest
SHA256 e805a16022d22f25c53fa516b3d72d48df0278a0689e221f2ae838b1b0ff2437
MD5 c4b76acb9363dd480b1ab466746e4556
BLAKE2b-256 fd2224187004a3dcf00f4ab89ee2ebc92353bc62f4343456e318332f08f69a8d

See more details on using hashes here.

File details

Details for the file data_harvesting-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for data_harvesting-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a34c2bf99c5ee66bbbccc99016a8a8030f63e0915916f0c1f00cd44062f0a821
MD5 ce223d7730c65f41592402cfbeb38619
BLAKE2b-256 6df3a79f29f96a47503a62dd38ba2c0cc17e9a96787c8333df0a05bc187c28d3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page