OAIPMH harvester library for Invenio. Development only, not for production.

Project description

OARepo OAI-PMH harvester

An OAI-PMH harvesing library for Invenio 3.5+. The library provides initial transformation of OAI-PMH payload to an intermediary json representation which is later on transformed by a specific transformer to invenio records.

Due to their generic nature, these transformers are not part of this library but have to be provided by an application.

The progress and transformation errors are captured within the database.

For now, the library does not provide error notifications, but these will be added. A sentry might be used for the logging & reporting.

Installation

poetry add oarepo-oaipmh-harvester

Configuration

All configuration is inside the database model OAIHarvesterConfig. There is a command-line tool to add a new config:

invenio oaiharvester add \
  --code nusl \
  --name NUŠL \
  --url "http://invenio.nusl.cz/oai2d/" \
  --set global \
  --prefix marcxml 
  --transformer nusl_oai.transformer.NuslTransformer

This will register an oai-pmh harvester with code "nusl", its url, oai set and metadata prefix. Records from this harvester will be transformed by the NuslTransformer before they are written to the repository.

Options:

Usage: invenio oaiharvester add [OPTIONS]

Options:
  --code TEXT         OAI server code  [required]
  --name TEXT         OAI server name  [required]
  --url TEXT          OAI base url  [required]
  --set TEXT          OAI set  [required]
  --prefix TEXT       OAI metadata prefix  [required]
  --parser TEXT       OAI metadata parser. If not passed, a prefix-based default is used
  --transformer TEXT  Transformer class  [required]

Usage

Command-line

On command line, invoke

oaiharvester harvest nusl <optional list of oai identifiers to harvest>

Options:

  -a, --all-records  Re-harvest all records, not from the last timestamp
  --background       Start Harvest on background (via celery task), return immediately
  --dump-to TEXT     Do not import records, just dump (cache) them to this
                     directory (mostly for debugging)
  --load-from TEXT   Do not contact oai-pmh server but load the records from
                     this directory (created by dump-to option)

Celery task

@shared_task
def oai_harvest(
        harvester_id: str, 
        start_from: str, 
        load_from: str = None, 
        dump_to: str = None,
        on_background=False, 
        identifiers=None):
    """
    @param harvester_id: id of the harvester configuration (OAIHarvesterConfig) object
    @param start_from: datestamp (either YYYY-MM-DD or YYYY-MM-DDThh:mm:ss, 
           depends on the OAI endpoint), inclusive
    @param load_from: if set, a path to the directory on the filesystem where 
           the OAI-PMH data are present (in *.json.gz files)
    @param dump_to: if set, harvested metadata will be parsed from xml to json 
           and stored into this directory, not to the repository
    @param on_background: if True, transformation and storage will be started in celery tasks and can run in parallel.
           If false, they will run sequentially inside this task
    @param identifiers: if load_from is set, it is a list of file names within the directory. 
           If load_from is not set, these are oai identifiers for GetRecord. If not set at all, all records from 
           start_from are harvested 
    """

Harvest status

Each harvest creates a row in OAIHarvestRun database table containing first and last datestamps and harvest status (running, completed, errored, ...)

A run is split into a chunk of records and each chunk is represented in OAIHarvestRunBatch database table. It contains a chunk status (running, completed, warning, failed, ...) and a list of identifiers harvested and their status (ok, warning during harvesting the identifier, harvesting the identifier failed). The table also contains details of the warnings/errors.

Custom parsers and transformers

The input OAI xml is at first parsed via parsers into a json format.

MARC-XML and DC parsers are supported out of the box. See the section below if you need a different parser

The JSON is then transformed into an invenio record via a transformer class. As different repositories use different semantic of fields (even in MARC), this step can not be generic and implementor is required to provide his/her own transformer class.

Transformer

A simple transformer, that transforms just the title from MARC-XML input might look like:

from typing import List
from oarepo_oaipmh_harvester import OAITransformer, OAIRecord, OAIHarvestRunBatch

from my_record.proxies import current_service
from my_record.records.api import MyRecord

class NuslTransformer(OAITransformer):
    oaiidentifier_search_property = 'metadata_systemIdentifiers_identifier'
    # the name of service filter that accesses the record's OAI identifier
    oaiidentifier_search_path = ('metadata', 'systemIdentifiers', 'identifier')
    # path to the oai record identifier inside the record

    # invenio service that will be used to create/update the record
    record_service = current_service
    # invenio record for this record
    record_model = MyRecord 
    

    def transform_single(self, rec: OAIRecord):
        # add all your transformations here
        rec.transformed.update({
            'metadata': {
                'title': rec['24500a']
            }
        })

Parser

A parser is responsible for transforming the XML document into an intermediary JSON.

For implementation details see MarcxmlParser.

Project details

Release history Release notifications | RSS feed

4.1.7

Oct 21, 2024

4.1.6

Oct 4, 2024

4.1.4

Sep 25, 2024

4.1.3

Aug 16, 2024

4.1.2

Aug 5, 2024

4.1.1

Jul 31, 2024

4.1.0

Jul 16, 2024

4.0.46

Jun 24, 2024

4.0.45

Mar 15, 2024

4.0.44

Jan 18, 2024

4.0.43

Dec 14, 2023

4.0.42

Nov 16, 2023

4.0.41

Nov 16, 2023

4.0.40

Nov 16, 2023

4.0.37

Nov 13, 2023

4.0.36

Nov 8, 2023

4.0.35

Nov 8, 2023

4.0.34

Nov 6, 2023

4.0.33

Nov 6, 2023

4.0.32

Nov 6, 2023

4.0.31

Nov 6, 2023

4.0.30

Nov 1, 2023

4.0.29

Nov 1, 2023

4.0.28

Oct 18, 2023

4.0.27

Oct 9, 2023

4.0.26

Sep 8, 2023

4.0.25

Sep 8, 2023

4.0.24

Sep 5, 2023

4.0.23

Aug 28, 2023

4.0.22

Aug 8, 2023

4.0.21

May 31, 2023

4.0.20

May 30, 2023

4.0.19

May 19, 2023

4.0.18

May 19, 2023

4.0.17

May 15, 2023

4.0.16

Apr 26, 2023

4.0.15

Apr 26, 2023

4.0.14

Apr 25, 2023

4.0.13

Apr 25, 2023

4.0.12

Apr 24, 2023

4.0.11

Apr 24, 2023

4.0.9

Apr 24, 2023

4.0.8

Apr 24, 2023

4.0.7

Apr 20, 2023

4.0.6

Apr 19, 2023

4.0.5

Apr 18, 2023

4.0.4

Mar 27, 2023

4.0.3

Mar 26, 2023

4.0.2

Mar 22, 2023

4.0.1

Mar 21, 2023

4.0.0

Mar 19, 2023

This version

3.1.3

Aug 8, 2022

3.1.1

Aug 8, 2022

3.0.8

Jun 1, 2022

3.0.7

May 30, 2022

3.0.6

May 30, 2022

3.0.5

May 30, 2022

3.0.4

Mar 11, 2022

3.0.3

Mar 11, 2022

3.0.2

Mar 6, 2022

3.0.1

Mar 6, 2022

3.0.0

Mar 4, 2022

2.0.0a25 pre-release

Feb 10, 2021

2.0.0a24 pre-release

Jan 29, 2021

2.0.0a23 pre-release

Jan 27, 2021

2.0.0a22 pre-release

Jan 15, 2021

2.0.0a21 pre-release

Jan 14, 2021

2.0.0a20 pre-release

Jan 14, 2021

2.0.0a19 pre-release

Jan 12, 2021

2.0.0a18 pre-release

Jan 7, 2021

2.0.0a15 pre-release

Dec 2, 2020

2.0.0a14 pre-release

Dec 1, 2020

2.0.0a13 pre-release

Nov 27, 2020

2.0.0a12 pre-release

Nov 27, 2020

2.0.0a11 pre-release

Nov 26, 2020

2.0.0a10 pre-release

Nov 25, 2020

2.0.0a9 pre-release

Nov 24, 2020

2.0.0a8 pre-release

Nov 24, 2020

2.0.0a7 pre-release

Nov 23, 2020

2.0.0a6 pre-release

Nov 20, 2020

2.0.0a5 pre-release

Nov 10, 2020

2.0.0a4 pre-release

Nov 10, 2020

2.0.0a3 pre-release

Nov 5, 2020

2.0.0a2 pre-release

Nov 4, 2020

2.0.0a1 pre-release

Nov 4, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oarepo-oai-pmh-harvester-3.1.3.tar.gz (36.0 kB view details)

Uploaded Aug 8, 2022 Source

Built Distribution

oarepo_oai_pmh_harvester-3.1.3-py3-none-any.whl (64.1 kB view details)

Uploaded Aug 8, 2022 Python 3

File details

Details for the file oarepo-oai-pmh-harvester-3.1.3.tar.gz.

File metadata

Download URL: oarepo-oai-pmh-harvester-3.1.3.tar.gz
Upload date: Aug 8, 2022
Size: 36.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for oarepo-oai-pmh-harvester-3.1.3.tar.gz
Algorithm	Hash digest
SHA256	`31175279e3ed12fc7bdd943828e36b0fd0a3092f7cc1fb7943b174e56b27d14c`
MD5	`9081461cd370153d3e0b67b420053b28`
BLAKE2b-256	`d20be49f2f29fcfca2afce9dbe710273e05bcfe323231ba0ae71defc59c7d30a`

See more details on using hashes here.

File details

Details for the file oarepo_oai_pmh_harvester-3.1.3-py3-none-any.whl.

File metadata

Download URL: oarepo_oai_pmh_harvester-3.1.3-py3-none-any.whl
Upload date: Aug 8, 2022
Size: 64.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for oarepo_oai_pmh_harvester-3.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5143282d6c1f377bf1dffcf3fa1dd31f3ebedaa91eb3830db5cc972d9b68ff31`
MD5	`646257a9eee076e630adfc69e57886a1`
BLAKE2b-256	`1ac55676fec98a5041f673b9754f519300337213533bdc1bd0d5fffe4b6702c1`