Skip to main content

OAIPMH harvesters for National repository

Project description

NR OAI-PMH Harvesters

OAI-PMH metadata transformers for the Czech National Repository (Národní Repozitář). This package converts harvested MARC 21 records from external repositories into the NR metadata schema, enabling seamless ingestion into the Invenio-based national repository infrastructure.

Overview

nr-oaipmh-harvesters is a plugin for oarepo-oai-pmh-harvester that provides transformer implementations. Each transformer maps MARC 21 fields from a specific source repository to the NR documents metadata model (nr-metadata).

Currently supported sources:

Source Transformer key Description
NUSL (Národní úložiště šedé literatury) nusl National Repository of Grey Literature — theses, reports, conference papers, and more

Requirements

  • Python ≥ 3.9
  • A running Invenio instance with the NR stack
  • Dependencies (installed automatically): oarepo-oai-pmh-harvester >= 4.0.0, dojson, Levenshtein, nr-metadata

Installation

pip install nr-oaipmh-harvesters

The package registers itself as an Invenio extension via entry points — no additional configuration is needed beyond the standard Invenio app setup.

Usage

Registering a harvester

Use the Invenio CLI to register a new OAI-PMH harvester. For NUSL:

invenio oarepo oai harvester add nusl \
    --name "NUSL harvester" \
    --url http://invenio.nusl.cz/oai2d/ \
    --set global \
    --prefix marcxml \
    --loader sickle \
    --transformer marcxml \
    --transformer nusl \
    --writer 'service{service=nr_documents}'

This sets up a harvester that:

  1. Connects to the NUSL OAI-PMH endpoint.
  2. Fetches records using the sickle loader.
  3. Pipes them through the marcxml transformer (generic MARC XML → JSON), then the nusl transformer (NUSL-specific mapping to NR schema).
  4. Writes the resulting records via the nr_documents service.

Running the harvest

# Harvest new/updated records (incremental, from last timestamp)
invenio oarepo oai harvester run nusl

# Re-harvest everything
invenio oarepo oai harvester run nusl --all-records

# Run on background via Celery
invenio oarepo oai harvester run nusl --on-background

# Harvest specific record(s)
invenio oarepo oai harvester run nusl --identifier oai:invenio.nusl.cz:12345

Architecture

Transformer pipeline

OAI-PMH endpoint
  │
  ▼
Loader (sickle)         ── fetches raw XML
  │
  ▼
Transformer: marcxml    ── XML → flat JSON  {marc_field_code: value}
  │
  ▼
Transformer: nusl       ── MARC JSON → NR metadata schema
  │
  ▼
Writer (service)        ── creates/updates Invenio records

NUSL transformer

The NUSLTransformer (extending OAIRuleTransformer) handles the following MARC 21 fields:

MARC field Target metadata
001 System identifier (NUSL control number)
020 / 022 ISBN / ISSN
035 Original OAI record identifier
041 Language
046 Date issued / date modified
245 / 246 Title, translated title, alternate title, subtitle
260 Publisher
336 Certified methodology resource type
490 Series
502 Degree grantor, date defended
520 Abstract
540 Rights / license (Creative Commons parsing)
586 Defense status
598 Notes
650 / 653 Subjects and keywords (Czech / English)
656 Study field
710 Degree grantor (institutional)
711 Event (conference)
720 Creators and contributors (with ORCID, affiliation resolution)
773 Related item
856 Original record URL, external location, file attachments
970 Catalogue system number
980 Resource type
996 Accessibility
998 Collection
999 Funding references

The transformer also performs post-processing such as deduplication of languages, contributors, subjects, and additional titles.

Vocabulary resolution

The package includes a VocabularyCache that resolves free-text institution names (from MARC 720 affiliations) against the NR institutions vocabulary using Lucene queries and Levenshtein distance matching. Resolved institutions are cached via invenio-cache with a configurable TTL (default: 1 hour). A fallback temporary institutions lookup table (temp_institutions.py) is used for records that cannot be matched through the vocabulary service.

Project structure

nr-oaipmh-harvesters/
├── nr_oaipmh_harvesters/
│   ├── config.py                 # Registers transformers in DATASTREAMS_TRANSFORMERS
│   ├── ext.py                    # Invenio extension (NRDocsOAIHarvesterExt)
│   └── nusl/
│       ├── __init__.py           # Exports NUSLTransformer
│       ├── transformer.py        # NUSL MARC 21 → NR metadata transformer
│       └── temp_institutions.py  # Fallback institution name mapping
├── tests/
│   ├── run_transform.py          # End-to-end harvest test
│   ├── run_transform_separately.py  # Per-record transformer test with validation
│   ├── test_institutions.py      # Institution resolution tests
│   ├── get_code.py
│   └── invenio.cfg
├── format.sh                     # Code formatting (black, autoflake, isort)
├── setup.cfg
├── setup.py
├── pyproject.toml
└── README.md

Development

Setup

git clone git@github.com:Narodni-repozitar/nr-oaipmh-harvesters.git
cd nr-oaipmh-harvesters
pip install -e ".[dev]"

Code formatting

./format.sh

This runs black (target Python 3.10), autoflake (unused import removal), and isort (import sorting, black profile).

Testing transformations locally

You can test the transformer against a local directory of OAI records:

# Requires a running Invenio app context and an oai-data directory
python tests/run_transform_separately.py

Errors are written to /tmp/errors.yaml for inspection.

Adding a new source repository

To add support for harvesting from a new OAI-PMH source:

  1. Create a new sub-package under nr_oaipmh_harvesters/ (e.g., nr_oaipmh_harvesters/my_source/).
  2. Implement a transformer class extending OAIRuleTransformer from oarepo-oaipmh-harvester.
  3. Register the transformer in config.py by adding it to the DATASTREAMS_TRANSFORMERS dictionary.
  4. Register the harvester via the Invenio CLI with --transformer my_source.

Related packages

Authors

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nr_oaipmh_harvesters-1.0.75.tar.gz (43.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nr_oaipmh_harvesters-1.0.75-py3-none-any.whl (39.0 kB view details)

Uploaded Python 3

File details

Details for the file nr_oaipmh_harvesters-1.0.75.tar.gz.

File metadata

  • Download URL: nr_oaipmh_harvesters-1.0.75.tar.gz
  • Upload date:
  • Size: 43.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nr_oaipmh_harvesters-1.0.75.tar.gz
Algorithm Hash digest
SHA256 5ce27863f1f83bace5a6b99c6bcfe1604756e3461a228ebe44e6a679a3f8ab2e
MD5 d3fe9682dc5e9d15f4da863cdcb90ae2
BLAKE2b-256 d540d258b6882ea209a29932b440b90add91f1fa5e5b066b1f7553f0a21a74a1

See more details on using hashes here.

File details

Details for the file nr_oaipmh_harvesters-1.0.75-py3-none-any.whl.

File metadata

File hashes

Hashes for nr_oaipmh_harvesters-1.0.75-py3-none-any.whl
Algorithm Hash digest
SHA256 73018a392d3b7d1ae5c4af042681713e300b22d6b16d375a0fde58ecff225003
MD5 4cca84031afad127b8d8c605144ea8e0
BLAKE2b-256 6458ad406eacf2defbcabd5606829b0eddd7f337d39143f61b5e34d2096d4106

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page