Skip to main content

OAIPMH harvesters for National repository

Project description

NR OAI-PMH Harvesters

OAI-PMH metadata transformers for the Czech National Repository (Národní Repozitář). This package converts harvested MARC 21 records from external repositories into the NR metadata schema, enabling seamless ingestion into the Invenio-based national repository infrastructure.

Overview

nr-oaipmh-harvesters is a plugin for oarepo-oai-pmh-harvester that provides transformer implementations. Each transformer maps MARC 21 fields from a specific source repository to the NR documents metadata model (nr-metadata).

Currently supported sources:

Source Transformer key Description
NUSL (Národní úložiště šedé literatury) nusl National Repository of Grey Literature — theses, reports, conference papers, and more

Requirements

  • Python ≥ 3.9
  • A running Invenio instance with the NR stack
  • Dependencies (installed automatically): oarepo-oai-pmh-harvester >= 4.0.0, dojson, Levenshtein, nr-metadata

Installation

pip install nr-oaipmh-harvesters

The package registers itself as an Invenio extension via entry points — no additional configuration is needed beyond the standard Invenio app setup.

Usage

Registering a harvester

Use the Invenio CLI to register a new OAI-PMH harvester. For NUSL:

invenio oarepo oai harvester add nusl \
    --name "NUSL harvester" \
    --url http://invenio.nusl.cz/oai2d/ \
    --set global \
    --prefix marcxml \
    --loader sickle \
    --transformer marcxml \
    --transformer nusl \
    --writer 'service{service=nr_documents}'

This sets up a harvester that:

  1. Connects to the NUSL OAI-PMH endpoint.
  2. Fetches records using the sickle loader.
  3. Pipes them through the marcxml transformer (generic MARC XML → JSON), then the nusl transformer (NUSL-specific mapping to NR schema).
  4. Writes the resulting records via the nr_documents service.

Running the harvest

# Harvest new/updated records (incremental, from last timestamp)
invenio oarepo oai harvester run nusl

# Re-harvest everything
invenio oarepo oai harvester run nusl --all-records

# Run on background via Celery
invenio oarepo oai harvester run nusl --on-background

# Harvest specific record(s)
invenio oarepo oai harvester run nusl --identifier oai:invenio.nusl.cz:12345

Architecture

Transformer pipeline

OAI-PMH endpoint
  │
  ▼
Loader (sickle)         ── fetches raw XML
  │
  ▼
Transformer: marcxml    ── XML → flat JSON  {marc_field_code: value}
  │
  ▼
Transformer: nusl       ── MARC JSON → NR metadata schema
  │
  ▼
Writer (service)        ── creates/updates Invenio records

NUSL transformer

The NUSLTransformer (extending OAIRuleTransformer) handles the following MARC 21 fields:

MARC field Target metadata
001 System identifier (NUSL control number)
020 / 022 ISBN / ISSN
035 Original OAI record identifier
041 Language
046 Date issued / date modified
245 / 246 Title, translated title, alternate title, subtitle
260 Publisher
336 Certified methodology resource type
490 Series
502 Degree grantor, date defended
520 Abstract
540 Rights / license (Creative Commons parsing)
586 Defense status
598 Notes
650 / 653 Subjects and keywords (Czech / English)
656 Study field
710 Degree grantor (institutional)
711 Event (conference)
720 Creators and contributors (with ORCID, affiliation resolution)
773 Related item
856 Original record URL, external location, file attachments
970 Catalogue system number
980 Resource type
996 Accessibility
998 Collection
999 Funding references

The transformer also performs post-processing such as deduplication of languages, contributors, subjects, and additional titles.

Vocabulary resolution

The package includes a VocabularyCache that resolves free-text institution names (from MARC 720 affiliations) against the NR institutions vocabulary using Lucene queries and Levenshtein distance matching. Resolved institutions are cached via invenio-cache with a configurable TTL (default: 1 hour). A fallback temporary institutions lookup table (temp_institutions.py) is used for records that cannot be matched through the vocabulary service.

Project structure

nr-oaipmh-harvesters/
├── nr_oaipmh_harvesters/
│   ├── config.py                 # Registers transformers in DATASTREAMS_TRANSFORMERS
│   ├── ext.py                    # Invenio extension (NRDocsOAIHarvesterExt)
│   └── nusl/
│       ├── __init__.py           # Exports NUSLTransformer
│       ├── transformer.py        # NUSL MARC 21 → NR metadata transformer
│       └── temp_institutions.py  # Fallback institution name mapping
├── tests/
│   ├── run_transform.py          # End-to-end harvest test
│   ├── run_transform_separately.py  # Per-record transformer test with validation
│   ├── test_institutions.py      # Institution resolution tests
│   ├── get_code.py
│   └── invenio.cfg
├── format.sh                     # Code formatting (black, autoflake, isort)
├── setup.cfg
├── setup.py
├── pyproject.toml
└── README.md

Development

Setup

git clone git@github.com:Narodni-repozitar/nr-oaipmh-harvesters.git
cd nr-oaipmh-harvesters
pip install -e ".[dev]"

Code formatting

./format.sh

This runs black (target Python 3.10), autoflake (unused import removal), and isort (import sorting, black profile).

Testing transformations locally

You can test the transformer against a local directory of OAI records:

# Requires a running Invenio app context and an oai-data directory
python tests/run_transform_separately.py

Errors are written to /tmp/errors.yaml for inspection.

Adding a new source repository

To add support for harvesting from a new OAI-PMH source:

  1. Create a new sub-package under nr_oaipmh_harvesters/ (e.g., nr_oaipmh_harvesters/my_source/).
  2. Implement a transformer class extending OAIRuleTransformer from oarepo-oaipmh-harvester.
  3. Register the transformer in config.py by adding it to the DATASTREAMS_TRANSFORMERS dictionary.
  4. Register the harvester via the Invenio CLI with --transformer my_source.

Related packages

Authors

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nr_oaipmh_harvesters-1.0.74.tar.gz (42.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nr_oaipmh_harvesters-1.0.74-py3-none-any.whl (38.5 kB view details)

Uploaded Python 3

File details

Details for the file nr_oaipmh_harvesters-1.0.74.tar.gz.

File metadata

  • Download URL: nr_oaipmh_harvesters-1.0.74.tar.gz
  • Upload date:
  • Size: 42.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nr_oaipmh_harvesters-1.0.74.tar.gz
Algorithm Hash digest
SHA256 db09d73e7f1f4868a73b8e35ada4c48f735f3c7b3bbdb36ed00cad94a6a47958
MD5 df3d8cdbd818b90799ff646a4ac295af
BLAKE2b-256 8f75ccc3914fac68d6b3eae99973fdcb6ba0f75d7c2cd96866c6965a36a3067d

See more details on using hashes here.

File details

Details for the file nr_oaipmh_harvesters-1.0.74-py3-none-any.whl.

File metadata

File hashes

Hashes for nr_oaipmh_harvesters-1.0.74-py3-none-any.whl
Algorithm Hash digest
SHA256 e06553900fdad8522c70717c23e0cf1778a6bbf8bb1465c7645c505bf175a827
MD5 1b6b416a0de0ce82a9fdff759318d657
BLAKE2b-256 c40e018a11fbf1c61ad68e9354f399953c8b4cc88fe384e9769642e7b76a538b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page