A package for harvesting, curating and uploading metadata.

These details have not been verified by PyPI

Project description

harvester-curator

harvester-curator is a Python-based automation tool designed to streamline metadata collection and management in research data management. It automates the extraction of metadata from source repositories or directories, and then seamlessly maps and adapts this metadata to comply with the designated repository's metadata schemas, preparing it for integration into datasets.

In essence, harvester-curator synergizes file crawling and crosswalking capabilities to automate the complex and labor-intensive processes of metadata collection and repository population. Tailored for efficiency and accuracy in Dataverse environments, it equips researchers with a streamlined method to accelerate data management workflows, ensuring that their research data aligns with the FAIR principles of Findability, Accessibility, Interoperability, and Reusability.

Tool Workflow

harvester-curator simplifies metadata collection and integration into research data repositories through two primary phases: the Harvester phase, focusing on the automated extraction of metadata, and the Curator phase, dedicated to mapping and adapting this metadata for integrating into datasets within a target repository.

Detailed Tool Workflow (click to expand)

Let's delve deeper into the operational details of harvester-curator's workflow.

harvester-curator optimizes metadata collection and integration in two main phases:

*Harvester Phase: Automates the extraction of metadata from sources specified by the user, including repositories or directories.

*Curator Phase: Seamlessly maps and adapts the harvested metadata to ensure its integration into the target repository.

Harvester Phase

During the initial Harvester phase, a crawler methodically scans files within the source directory and its subdirectories, sorting them by type and extension. This results in files being systematically grouped for further processing. Customized parsers are then utilized to extract metadata from these categorized groups, compiling the data into a well-organized JSON format.

We currently support a variety of parsers, including VTK, HDF5, CFF, BibTeX, YAML and JSON:

VTK-parser: Supports file types such as vtk, vti, vtr, vtp, vts, vtu, pvti, pvtr, pvtp, pvts and pvtu.

HDF5-parser: Handles formats including hdf5, h5, he5.

JSON-parser: Processes types json and jsonld.

Curator Phase

In the subsequent Curator phase, harvester-curator aligns the harvested metadata with the metadata schemas of the target repository, such as DaRUS. It matches the harvested metadata attributes with those defined in the metadata schemas and integrates the values into the appropriate locations. Additionally, it supports direct upload of curated metadata to the destination repository.

The Curator algorithm employs mappings to reconcile discrepancies between the naming conventions of harvested metadata and the metadata schemas of the target repository. Given that harvested metadata typically features a flat structure -- where attributes, values, and paths are at the same level, unlike the hierarchical organization common in repository schemas—-the algorithm adapts harvested metadata to ensure compatibility:

Mapping and Matching: It begins by updating attribute values and paths of harvested metadata based on predefined mappings, taking into account the hierarchical structure of repository schemas.
Attribute Matching: The algorithm searches for matching attributes within the target repository's schema. If no direct match is found, it combines parent and attribute information in search of a suitable match. Attributes that remain unmatched are noted for subsequent matching attempts with an alternative schema.
Parent Matching: Upon finding a match, the algorithm designates the corresponding parent from the schema as the "matching parent." If a direct parent match does not exist, or if multiple matches are found, it examines common elements between the schema and harvested metadata to determine the most appropriate matching parent.
Dictionary Preparation: Attributes that successfully match are compiled into a dictionary that includes the mapped attribute, value, parent, and schema name, ensuring the metadata is compatible with the target repository.
Similarity Matching: When exact matches are not found across all schemas, the algorithm employs similarity matching with an 85% threshold to accommodate differences in metadata schema integration.

This systematic approach ensures compatibility with the requirements of the target repository and enhances the precision of metadata integration by utilizing direct mapping, exact matching and similarity matching to overcome schema alignment challenges.

Project Structure

The harvester-curator project is organized as follows:

src/harvester_curator/: The main app package directory containing all the source code.
tests/: Contains all tests for the harvester-curator application.
images/: Contains images used in the documentation, such as the workflow diagram.

How to Install harvester-curator:

harvester-curator can be easily installed via pip, the recommended tool for installing python packages.

0. Install pip (if not already installed):

If you don’t have pip installed, you can install it with the following command:

python3 -m ensurepip --upgrade

For more detailed instructions on installing pip, please visit the official pip installation guide.

1. Install harvester-curator:

To install harvester-curator from PyPI, simply run:

pip install harvester-curator

This will automatically download and install harvester-curator and its dependencies.

2. Verify Installation:

After the installation, you can verify it by running:

harvester-curator --help

Usage

The harvester-curator app is designed to facilitate the efficient collection, curation and uploading of metadata. Follow these instructions to utilize the app and its available subcommands effectively.

General Help

For an overview of all commands and their options:

harvester-curator --help

Harvesting Metadata

To collect metadata from files in a specified directory:

harvester-curator harvest --dir_path "/path/to/directory" --output_filepath "/path/to/harvested_output.json"

Or, using short options:

harvester-curator harvest -d "/path/to/directory" -o "/path/to/harvested_output.json"

Important Note: Without --dir_path, the default is the example folder within the harvester_curator package. Without --output_filepath, harvested metadata is saved to output/harvested_output.json by default.

Curating Metadata

To process and align harvested curation with specified schema metadata blocks:

harvester-curator curate  --harvested_metadata_filepath "/path/to/harvested_output.json" --output_filepath "/path/to/curated_output.json" --api_endpoints_filepath "/path/to/schema_api_endpoints.json"

Or, using short options:

harvester-curator curate  -h "/path/to/harvested_output.json" -o "/path/to/curated_output.json" -a "/path/to/schema_api_endpoints.json"

Important Note: Default file paths are used if options are not specified:

--harvested_metadata_filepath defaults to output/harvested_output.json.
--output_filepath defaults to output/curated_output.json.
--api_endpoints_filepath defaults to curator/api_end_points/darus_md_schema_api_endpoints.json.

Uploading Metadata

To upload curated metadata to a Dataverse repository as dataset metadata:

harvester-curator upload  --server_url "https://xxx.xxx.xxx" --api_token "abc0_def123_gkg456__hijk789" --dataverse_id "mydataverse_alias" --curated_metadata_filepath "/path/to/curated_output.json"

Or, using short options:

harvester-curator upload  -s "https://xxx.xxx.xxx" -a "abc0_def123_gkg456__hijk789" -d "mydataverse_alias" -c "/path/to/curated_output.json"

Important Note: The default for --curated_metadata_filepath is output/curated_output.json.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.3

Sep 19, 2024

0.0.2

Sep 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harvester_curator-0.0.3.tar.gz (31.9 kB view details)

Uploaded Sep 19, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

harvester_curator-0.0.3-py3-none-any.whl (37.5 kB view details)

Uploaded Sep 19, 2024 Python 3

File details

Details for the file harvester_curator-0.0.3.tar.gz.

File metadata

Download URL: harvester_curator-0.0.3.tar.gz
Upload date: Sep 19, 2024
Size: 31.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.12.6

File hashes

Hashes for harvester_curator-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`c33dbf6bc91135f8eb93a7ec99c07e2636cba1f3796e41402a1d89458f8739de`
MD5	`f92ab815c0ea8de5abacd1e33c1fa897`
BLAKE2b-256	`38fba5f7d7a6ab2c5895841f09cf96cf8272c98a7098f4c503cb83020b985c8e`

See more details on using hashes here.

File details

Details for the file harvester_curator-0.0.3-py3-none-any.whl.

File metadata

Download URL: harvester_curator-0.0.3-py3-none-any.whl
Upload date: Sep 19, 2024
Size: 37.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.12.6

File hashes

Hashes for harvester_curator-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`444c2004eb17e4ceda092ebf00e07020f216fb69c4cca5cd21195ac2233c8420`
MD5	`07e46082be4e390a139315e59104f711`
BLAKE2b-256	`80ec199fee2842d1c8060910ee25e9db47f6ba2d5167c092333521a43c67303a`

See more details on using hashes here.

harvester-curator 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

harvester-curator

Tool Workflow

Harvester Phase

Curator Phase

Project Structure

How to Install harvester-curator:

0. Install pip (if not already installed):

1. Install harvester-curator:

2. Verify Installation:

Usage

General Help

Harvesting Metadata

Curating Metadata

Uploading Metadata

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes