Skip to main content

OAIPMH harvesters for National repository

Project description

NR OAI-PMH Harvesters

OAI-PMH metadata transformers for the Czech National Repository (Národní Repozitář). This package converts harvested MARC 21 records from external repositories into the NR metadata schema, enabling seamless ingestion into the Invenio-based national repository infrastructure.

Overview

nr-oaipmh-harvesters is a plugin for oarepo-oai-pmh-harvester that provides transformer implementations. Each transformer maps MARC 21 fields from a specific source repository to the NR documents metadata model (nr-metadata).

Currently supported sources:

Source Transformer key Description
NUSL (Národní úložiště šedé literatury) nusl National Repository of Grey Literature — theses, reports, conference papers, and more

Requirements

  • Python ≥ 3.9
  • A running Invenio instance with the NR stack
  • Dependencies (installed automatically): oarepo-oai-pmh-harvester >= 4.0.0, dojson, Levenshtein, nr-metadata

Installation

pip install nr-oaipmh-harvesters

The package registers itself as an Invenio extension via entry points — no additional configuration is needed beyond the standard Invenio app setup.

Usage

Registering a harvester

Use the Invenio CLI to register a new OAI-PMH harvester. For NUSL:

invenio oarepo oai harvester add nusl \
    --name "NUSL harvester" \
    --url http://invenio.nusl.cz/oai2d/ \
    --set global \
    --prefix marcxml \
    --loader sickle \
    --transformer marcxml \
    --transformer nusl \
    --writer 'service{service=nr_documents}'

This sets up a harvester that:

  1. Connects to the NUSL OAI-PMH endpoint.
  2. Fetches records using the sickle loader.
  3. Pipes them through the marcxml transformer (generic MARC XML → JSON), then the nusl transformer (NUSL-specific mapping to NR schema).
  4. Writes the resulting records via the nr_documents service.

Running the harvest

# Harvest new/updated records (incremental, from last timestamp)
invenio oarepo oai harvester run nusl

# Re-harvest everything
invenio oarepo oai harvester run nusl --all-records

# Run on background via Celery
invenio oarepo oai harvester run nusl --on-background

# Harvest specific record(s)
invenio oarepo oai harvester run nusl --identifier oai:invenio.nusl.cz:12345

Architecture

Transformer pipeline

OAI-PMH endpoint
  │
  ▼
Loader (sickle)         ── fetches raw XML
  │
  ▼
Transformer: marcxml    ── XML → flat JSON  {marc_field_code: value}
  │
  ▼
Transformer: nusl       ── MARC JSON → NR metadata schema
  │
  ▼
Writer (service)        ── creates/updates Invenio records

NUSL transformer

The NUSLTransformer (extending OAIRuleTransformer) handles the following MARC 21 fields:

MARC field Target metadata
001 System identifier (NUSL control number)
020 / 022 ISBN / ISSN
035 Original OAI record identifier
041 Language
046 Date issued / date modified
245 / 246 Title, translated title, alternate title, subtitle
260 Publisher
336 Certified methodology resource type
490 Series
502 Degree grantor, date defended
520 Abstract
540 Rights / license (Creative Commons parsing)
586 Defense status
598 Notes
650 / 653 Subjects and keywords (Czech / English)
656 Study field
710 Degree grantor (institutional)
711 Event (conference)
720 Creators and contributors (with ORCID, affiliation resolution)
773 Related item
856 Original record URL, external location, file attachments
970 Catalogue system number
980 Resource type
996 Accessibility
998 Collection
999 Funding references

The transformer also performs post-processing such as deduplication of languages, contributors, subjects, and additional titles.

Vocabulary resolution

The package includes a VocabularyCache that resolves free-text institution names (from MARC 720 affiliations) against the NR institutions vocabulary using Lucene queries and Levenshtein distance matching. Resolved institutions are cached via invenio-cache with a configurable TTL (default: 1 hour). A fallback temporary institutions lookup table (temp_institutions.py) is used for records that cannot be matched through the vocabulary service.

Project structure

nr-oaipmh-harvesters/
├── nr_oaipmh_harvesters/
│   ├── config.py                 # Registers transformers in DATASTREAMS_TRANSFORMERS
│   ├── ext.py                    # Invenio extension (NRDocsOAIHarvesterExt)
│   └── nusl/
│       ├── __init__.py           # Exports NUSLTransformer
│       ├── transformer.py        # NUSL MARC 21 → NR metadata transformer
│       └── temp_institutions.py  # Fallback institution name mapping
├── tests/
│   ├── run_transform.py          # End-to-end harvest test
│   ├── run_transform_separately.py  # Per-record transformer test with validation
│   ├── test_institutions.py      # Institution resolution tests
│   ├── get_code.py
│   └── invenio.cfg
├── format.sh                     # Code formatting (black, autoflake, isort)
├── setup.cfg
├── setup.py
├── pyproject.toml
└── README.md

Development

Setup

git clone git@github.com:Narodni-repozitar/nr-oaipmh-harvesters.git
cd nr-oaipmh-harvesters
pip install -e ".[dev]"

Code formatting

./format.sh

This runs black (target Python 3.10), autoflake (unused import removal), and isort (import sorting, black profile).

Testing transformations locally

You can test the transformer against a local directory of OAI records:

# Requires a running Invenio app context and an oai-data directory
python tests/run_transform_separately.py

Errors are written to /tmp/errors.yaml for inspection.

Adding a new source repository

To add support for harvesting from a new OAI-PMH source:

  1. Create a new sub-package under nr_oaipmh_harvesters/ (e.g., nr_oaipmh_harvesters/my_source/).
  2. Implement a transformer class extending OAIRuleTransformer from oarepo-oaipmh-harvester.
  3. Register the transformer in config.py by adding it to the DATASTREAMS_TRANSFORMERS dictionary.
  4. Register the harvester via the Invenio CLI with --transformer my_source.

Related packages

Authors

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nr_oaipmh_harvesters-1.0.78.tar.gz (44.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nr_oaipmh_harvesters-1.0.78-py3-none-any.whl (39.7 kB view details)

Uploaded Python 3

File details

Details for the file nr_oaipmh_harvesters-1.0.78.tar.gz.

File metadata

  • Download URL: nr_oaipmh_harvesters-1.0.78.tar.gz
  • Upload date:
  • Size: 44.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nr_oaipmh_harvesters-1.0.78.tar.gz
Algorithm Hash digest
SHA256 36f396ea6b5fbc18ebaf4b0415b5b3fe4544303cf9d2e1a1122ce924d96354a7
MD5 44b66cde116fb1b2343bca30afdb249d
BLAKE2b-256 dd55274fe4e9dbb5128a2d5692b065d0bdd85084436ed70b4aeb51c9ae4039c1

See more details on using hashes here.

File details

Details for the file nr_oaipmh_harvesters-1.0.78-py3-none-any.whl.

File metadata

File hashes

Hashes for nr_oaipmh_harvesters-1.0.78-py3-none-any.whl
Algorithm Hash digest
SHA256 867d8ba365cae7accf20974931218054a71e824c5b071e267720c6706f50cb5f
MD5 1084e0bb63eb87cb2776acb636995d9a
BLAKE2b-256 ee9a7973a40b7227f07b2e398b7ce05e70329b97f6889f6b214837acd8bbc8e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page