Skip to main content

oarepo OAI-PMH converter.

Project description

oarepo-oai-pmh-harvester

OAI-PMH Client for Invenio under OArepo brand.

Build Status Coverage Status image image image

Installation

Library is stored in PyPi repository, so it is commonly installed through pip.

pip install oarepo-oai-pmh-harvester

Configuration

Data harvesting must be set in the configuration (invenio.cfg or via app.config). All settings are made via the OAREPO_OAI_PROVIDERS key. Config is a dictionary where the key is the provider code and each provider can have several individual settings / jobs called synchronizer.

OAREPO_OAI_PROVIDERS={
            "provider-name": {
                "description": "Short provider description",
                "synchronizers": [
                    {
                        "name": "xoai",
                        "oai_endpoint": "https://example.com/oai/",
                        "set": "example_set",
                        "metadata_prefix": "oai_dc",
                        "unhandled_paths": ["/dc/unhandled"],
                        "default_endpoint": "recid",
                        "use_default_endpoint": True,
                        "endpoint_mapping": {
                            "field_name": "doc_type",
                            "mapping": {
                                "record": "recid"
                            }
                        }
                    }
                ]
            },
        }

Parameters:

  • description: Test description of provider
  • synchronizers: Dictionary with individual settings
    • name: name of the setting/synchronizer
    • oai_endpoint: URL adress
    • set: name of OAI set
    • metadata_prefix: name of OAI metadata prefix
    • unhandled_paths: List of paths in json that are not handled by any rule.It must be specified, otherwise the client will report an error that the path was not processed by any rule.
    • default_endpoint: The name of the end_point defined in RECORDS_REST_ENDPOINTS from the invenio-records-rest library, which will be used as the base unless otherwise specified.
    • endpoint_mapping: If multiple invenio-records-rest endpoints are used, it is necessary to set rules for which endpoint will be assigned to a particular record. In most cases, an endpoint can be assigned based on a metadata field (field_name) that is assigned a dictionary mapping, where key is the value of the metadata field and the dictionary value is assigned to the endpoint.

Usage

The package is used to integrate the OAI-PMH client into Invenio. It is a wrapper that is built on the Sickle library. Provides integration with invenio-records. The purpose of the package is to ensure synchronization with a remote OAI-PMH source.

Successful data collection requires several steps, which consist of:

  1. Configuration (see configuration chapter)
  2. Parser: function that converts XML into JSON
  3. Rules: functions that convert raw JSON (from parser) into final JSON

Parsers

A function that transforms XML into JSON (implemented as a python dictionary). The module where the function is located must be specified in entry_points and the function itself marked with a decorator. The function takes lxml.etree._Element as an argument and returns a dictionary.

  • entry_points:

The module is registered in entry_points under the keyword oarepo_oai_pmh_harvester.parsers, for example as follows:

entry_points={
       'oarepo_oai_pmh_harvester.parsers': [
           'xoai = example.parser',
       ],
   }
  • decorator: The decorator has one parameter, the name of the metadata_format and that must be same as in config metadata_prefix. The function must accept one positional argument (etree._Element) and return a dictionary.
from oarepo_oai_pmh_harvester.proxies import current_oai_client

@current_oai_client.parser("xoai")
def xml_to_json_parser(etree):
    ...some magic
    return dict_

Rules

The raw parsed JSON is converted to the final JSON in the transformation. The built-in transformer recursively traverses the raw JSON and remaps the raw JSON to the final JSON. The transformer searches all paths to see if a rule exists for that path or if the path is in an unhandled path in the configuration. If it does not meet any of the conditions, it raises an error to warn the user that he has forgotten about a metadata field.

A rule is a function that accepts the el (element) and kwargs (name parameters) arguments and returns the reworked element as a python dictionary. The module that contains the rule must be specified in entry_points and the function itself must be registered using a decorator.

  • entry_points:

The module is registered in entry_points under the keyword oarepo_oai_pmh_harvester.rules, for example as follows:

entry_points={
       'oarepo_oai_pmh_harvester.rules': [
           'xoai = example.rule',
       ],
   }
  • decorator:

The decorator has four positional arguments and one named argument:

  1. provider_name: must be same as in config
  2. metadata_prefix: must be same as in config
  3. json_path: level is separated with "/"
  4. phase:
    • pre: the rule is applied during the creation of the final JSON.
    • post: the rule is applied after the all pre rules

The rule function itself must accept the el (element) and ** kwargs arguments in the signature. El is the JSON value at the given JSON address. It must return dictionary (eg: {"title": "Example title"})

Kwargs contain several useful variables:

  • paths: a set containing an absolute JSON path and all subsequent relative levels path eg (/dc/title/en, dc /title/en, title/en, en)
  • results: a list of individual results, which will make up the final JSON.
  • phase: pre or post phase
  • record: raw json as defaultdict

Example of a rule:

from oarepo_oai_pmh_harvester.proxies import current_oai_client


@current_oai_client.rule("provider_name", "metadata_prefix", "/dc/title/en", phase="pre")
def rule(el, **kwargs):
    value_ = el[0]["value"][0]
    return {"title": value_}

CLI

If all components (config, parser, rules) are set, the program can be run via the CLI:

Usage: invenio oai run [OPTIONS]

  Starts harvesting the resources set in invenio.cfg through the
  OAREPO_OAI_PROVIDERS environment variable.

Options:
  -p, --provider TEXT      Code name of provider, defined in invenio.cfg
  -s, --synchronizer TEXT  Code name of OAI-PMH setup, defined in invenio.cfg
  --break / --no-break     Break on error, if true program is terminated when
                           record cause error

  -o, --start_oai TEXT     OAI identifier from where synchronization begin
  -i, --start_id INTEGER   The serial number from which the synchronization
                           starts. This is useful if for some reason the
                           previous synchronization was interrupted at some
                           point.

  --help                   Show this message and exit.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oarepo-oai-pmh-harvester-2.0.0a4.tar.gz (25.9 kB view details)

Uploaded Source

Built Distribution

oarepo_oai_pmh_harvester-2.0.0a4-py3-none-any.whl (44.3 kB view details)

Uploaded Python 3

File details

Details for the file oarepo-oai-pmh-harvester-2.0.0a4.tar.gz.

File metadata

  • Download URL: oarepo-oai-pmh-harvester-2.0.0a4.tar.gz
  • Upload date:
  • Size: 25.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.7.7

File hashes

Hashes for oarepo-oai-pmh-harvester-2.0.0a4.tar.gz
Algorithm Hash digest
SHA256 90bcf8c3e11642ad7e8e949ac84dc188d19881da002b2215ae3ab20506541784
MD5 89d1a99a9b12386b5461c6bc542376aa
BLAKE2b-256 d9c20ba59b7fd1af6ce2d8f3fe0be1dae0806b10dbd4c05cf013ba0c71e3c1c4

See more details on using hashes here.

File details

Details for the file oarepo_oai_pmh_harvester-2.0.0a4-py3-none-any.whl.

File metadata

  • Download URL: oarepo_oai_pmh_harvester-2.0.0a4-py3-none-any.whl
  • Upload date:
  • Size: 44.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.7.7

File hashes

Hashes for oarepo_oai_pmh_harvester-2.0.0a4-py3-none-any.whl
Algorithm Hash digest
SHA256 9c7dfb752a9e3337474cd3c8e90fc1cddf8c1b4dd6ed4983e58735440ecc8266
MD5 06df4f394a573bbb3a5c68bb215287ab
BLAKE2b-256 360e1124fb1c1a2a7474961fd75ac92b90f9d8ac4048612820e1f8327c1a006e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page