Skip to main content

OAIPMH harvesters for National repository

Project description

NR OAI-PMH Harvesters

OAI-PMH metadata transformers for the Czech National Repository (Národní Repozitář). This package converts harvested MARC 21 records from external repositories into the NR metadata schema, enabling seamless ingestion into the Invenio-based national repository infrastructure.

Overview

nr-oaipmh-harvesters is a plugin for oarepo-oai-pmh-harvester that provides transformer implementations. Each transformer maps MARC 21 fields from a specific source repository to the NR documents metadata model (nr-metadata).

Currently supported sources:

Source Transformer key Description
NUSL (Národní úložiště šedé literatury) nusl National Repository of Grey Literature — theses, reports, conference papers, and more

Requirements

  • Python ≥ 3.9
  • A running Invenio instance with the NR stack
  • Dependencies (installed automatically): oarepo-oai-pmh-harvester >= 4.0.0, dojson, Levenshtein, nr-metadata

Installation

pip install nr-oaipmh-harvesters

The package registers itself as an Invenio extension via entry points — no additional configuration is needed beyond the standard Invenio app setup.

Usage

Registering a harvester

Use the Invenio CLI to register a new OAI-PMH harvester. For NUSL:

invenio oarepo oai harvester add nusl \
    --name "NUSL harvester" \
    --url http://invenio.nusl.cz/oai2d/ \
    --set global \
    --prefix marcxml \
    --loader sickle \
    --transformer marcxml \
    --transformer nusl \
    --writer 'service{service=nr_documents}'

This sets up a harvester that:

  1. Connects to the NUSL OAI-PMH endpoint.
  2. Fetches records using the sickle loader.
  3. Pipes them through the marcxml transformer (generic MARC XML → JSON), then the nusl transformer (NUSL-specific mapping to NR schema).
  4. Writes the resulting records via the nr_documents service.

Running the harvest

# Harvest new/updated records (incremental, from last timestamp)
invenio oarepo oai harvester run nusl

# Re-harvest everything
invenio oarepo oai harvester run nusl --all-records

# Run on background via Celery
invenio oarepo oai harvester run nusl --on-background

# Harvest specific record(s)
invenio oarepo oai harvester run nusl --identifier oai:invenio.nusl.cz:12345

Architecture

Transformer pipeline

OAI-PMH endpoint
  │
  ▼
Loader (sickle)         ── fetches raw XML
  │
  ▼
Transformer: marcxml    ── XML → flat JSON  {marc_field_code: value}
  │
  ▼
Transformer: nusl       ── MARC JSON → NR metadata schema
  │
  ▼
Writer (service)        ── creates/updates Invenio records

NUSL transformer

The NUSLTransformer (extending OAIRuleTransformer) handles the following MARC 21 fields:

MARC field Target metadata
001 System identifier (NUSL control number)
020 / 022 ISBN / ISSN
035 Original OAI record identifier
041 Language
046 Date issued / date modified
245 / 246 Title, translated title, alternate title, subtitle
260 Publisher
336 Certified methodology resource type
490 Series
502 Degree grantor, date defended
520 Abstract
540 Rights / license (Creative Commons parsing)
586 Defense status
598 Notes
650 / 653 Subjects and keywords (Czech / English)
656 Study field
710 Degree grantor (institutional)
711 Event (conference)
720 Creators and contributors (with ORCID, affiliation resolution)
773 Related item
856 Original record URL, external location, file attachments
970 Catalogue system number
980 Resource type
996 Accessibility
998 Collection
999 Funding references

The transformer also performs post-processing such as deduplication of languages, contributors, subjects, and additional titles.

Vocabulary resolution

The package includes a VocabularyCache that resolves free-text institution names (from MARC 720 affiliations) against the NR institutions vocabulary using Lucene queries and Levenshtein distance matching. Resolved institutions are cached via invenio-cache with a configurable TTL (default: 1 hour). A fallback temporary institutions lookup table (temp_institutions.py) is used for records that cannot be matched through the vocabulary service.

Project structure

nr-oaipmh-harvesters/
├── nr_oaipmh_harvesters/
│   ├── config.py                 # Registers transformers in DATASTREAMS_TRANSFORMERS
│   ├── ext.py                    # Invenio extension (NRDocsOAIHarvesterExt)
│   └── nusl/
│       ├── __init__.py           # Exports NUSLTransformer
│       ├── transformer.py        # NUSL MARC 21 → NR metadata transformer
│       └── temp_institutions.py  # Fallback institution name mapping
├── tests/
│   ├── run_transform.py          # End-to-end harvest test
│   ├── run_transform_separately.py  # Per-record transformer test with validation
│   ├── test_institutions.py      # Institution resolution tests
│   ├── get_code.py
│   └── invenio.cfg
├── format.sh                     # Code formatting (black, autoflake, isort)
├── setup.cfg
├── setup.py
├── pyproject.toml
└── README.md

Development

Setup

git clone git@github.com:Narodni-repozitar/nr-oaipmh-harvesters.git
cd nr-oaipmh-harvesters
pip install -e ".[dev]"

Code formatting

./format.sh

This runs black (target Python 3.10), autoflake (unused import removal), and isort (import sorting, black profile).

Testing transformations locally

You can test the transformer against a local directory of OAI records:

# Requires a running Invenio app context and an oai-data directory
python tests/run_transform_separately.py

Errors are written to /tmp/errors.yaml for inspection.

Adding a new source repository

To add support for harvesting from a new OAI-PMH source:

  1. Create a new sub-package under nr_oaipmh_harvesters/ (e.g., nr_oaipmh_harvesters/my_source/).
  2. Implement a transformer class extending OAIRuleTransformer from oarepo-oaipmh-harvester.
  3. Register the transformer in config.py by adding it to the DATASTREAMS_TRANSFORMERS dictionary.
  4. Register the harvester via the Invenio CLI with --transformer my_source.

Related packages

Authors

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nr_oaipmh_harvesters-1.0.73.tar.gz (42.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nr_oaipmh_harvesters-1.0.73-py3-none-any.whl (38.4 kB view details)

Uploaded Python 3

File details

Details for the file nr_oaipmh_harvesters-1.0.73.tar.gz.

File metadata

  • Download URL: nr_oaipmh_harvesters-1.0.73.tar.gz
  • Upload date:
  • Size: 42.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nr_oaipmh_harvesters-1.0.73.tar.gz
Algorithm Hash digest
SHA256 9c34acc026eda86e29b05acbf685b10a70e213cbb104117cf502097a1d767fc2
MD5 5f77865592b0be65107a792d5ef60eae
BLAKE2b-256 c934cb2e71cc18f688a3fa2225ad38db9bfc67606e5684bab0ffd0eef7dfbbb2

See more details on using hashes here.

File details

Details for the file nr_oaipmh_harvesters-1.0.73-py3-none-any.whl.

File metadata

File hashes

Hashes for nr_oaipmh_harvesters-1.0.73-py3-none-any.whl
Algorithm Hash digest
SHA256 443797c4a1e12a9bf29eed83ba90b964349edb8312aec4f0bcc084c15cd5cf41
MD5 0abeb0d23145c82e9308c06215b028b0
BLAKE2b-256 d5fb93318fba1da5abc8d5ee5ce991563126e2e37a1bcdbff807b43eaca42927

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page