Skip to main content

OAIPMH harvesters for National repository

Project description

NR OAI-PMH Harvesters

OAI-PMH metadata transformers for the Czech National Repository (Národní Repozitář). This package converts harvested MARC 21 records from external repositories into the NR metadata schema, enabling seamless ingestion into the Invenio-based national repository infrastructure.

Overview

nr-oaipmh-harvesters is a plugin for oarepo-oai-pmh-harvester that provides transformer implementations. Each transformer maps MARC 21 fields from a specific source repository to the NR documents metadata model (nr-metadata).

Currently supported sources:

Source Transformer key Description
NUSL (Národní úložiště šedé literatury) nusl National Repository of Grey Literature — theses, reports, conference papers, and more

Requirements

  • Python ≥ 3.9
  • A running Invenio instance with the NR stack
  • Dependencies (installed automatically): oarepo-oai-pmh-harvester >= 4.0.0, dojson, Levenshtein, nr-metadata

Installation

pip install nr-oaipmh-harvesters

The package registers itself as an Invenio extension via entry points — no additional configuration is needed beyond the standard Invenio app setup.

Usage

Registering a harvester

Use the Invenio CLI to register a new OAI-PMH harvester. For NUSL:

invenio oarepo oai harvester add nusl \
    --name "NUSL harvester" \
    --url http://invenio.nusl.cz/oai2d/ \
    --set global \
    --prefix marcxml \
    --loader sickle \
    --transformer marcxml \
    --transformer nusl \
    --writer 'service{service=nr_documents}'

This sets up a harvester that:

  1. Connects to the NUSL OAI-PMH endpoint.
  2. Fetches records using the sickle loader.
  3. Pipes them through the marcxml transformer (generic MARC XML → JSON), then the nusl transformer (NUSL-specific mapping to NR schema).
  4. Writes the resulting records via the nr_documents service.

Running the harvest

# Harvest new/updated records (incremental, from last timestamp)
invenio oarepo oai harvester run nusl

# Re-harvest everything
invenio oarepo oai harvester run nusl --all-records

# Run on background via Celery
invenio oarepo oai harvester run nusl --on-background

# Harvest specific record(s)
invenio oarepo oai harvester run nusl --identifier oai:invenio.nusl.cz:12345

Architecture

Transformer pipeline

OAI-PMH endpoint
  │
  ▼
Loader (sickle)         ── fetches raw XML
  │
  ▼
Transformer: marcxml    ── XML → flat JSON  {marc_field_code: value}
  │
  ▼
Transformer: nusl       ── MARC JSON → NR metadata schema
  │
  ▼
Writer (service)        ── creates/updates Invenio records

NUSL transformer

The NUSLTransformer (extending OAIRuleTransformer) handles the following MARC 21 fields:

MARC field Target metadata
001 System identifier (NUSL control number)
020 / 022 ISBN / ISSN
035 Original OAI record identifier
041 Language
046 Date issued / date modified
245 / 246 Title, translated title, alternate title, subtitle
260 Publisher
336 Certified methodology resource type
490 Series
502 Degree grantor, date defended
520 Abstract
540 Rights / license (Creative Commons parsing)
586 Defense status
598 Notes
650 / 653 Subjects and keywords (Czech / English)
656 Study field
710 Degree grantor (institutional)
711 Event (conference)
720 Creators and contributors (with ORCID, affiliation resolution)
773 Related item
856 Original record URL, external location, file attachments
970 Catalogue system number
980 Resource type
996 Accessibility
998 Collection
999 Funding references

The transformer also performs post-processing such as deduplication of languages, contributors, subjects, and additional titles.

Vocabulary resolution

The package includes a VocabularyCache that resolves free-text institution names (from MARC 720 affiliations) against the NR institutions vocabulary using Lucene queries and Levenshtein distance matching. Resolved institutions are cached via invenio-cache with a configurable TTL (default: 1 hour). A fallback temporary institutions lookup table (temp_institutions.py) is used for records that cannot be matched through the vocabulary service.

Project structure

nr-oaipmh-harvesters/
├── nr_oaipmh_harvesters/
│   ├── config.py                 # Registers transformers in DATASTREAMS_TRANSFORMERS
│   ├── ext.py                    # Invenio extension (NRDocsOAIHarvesterExt)
│   └── nusl/
│       ├── __init__.py           # Exports NUSLTransformer
│       ├── transformer.py        # NUSL MARC 21 → NR metadata transformer
│       └── temp_institutions.py  # Fallback institution name mapping
├── tests/
│   ├── run_transform.py          # End-to-end harvest test
│   ├── run_transform_separately.py  # Per-record transformer test with validation
│   ├── test_institutions.py      # Institution resolution tests
│   ├── get_code.py
│   └── invenio.cfg
├── format.sh                     # Code formatting (black, autoflake, isort)
├── setup.cfg
├── setup.py
├── pyproject.toml
└── README.md

Development

Setup

git clone git@github.com:Narodni-repozitar/nr-oaipmh-harvesters.git
cd nr-oaipmh-harvesters
pip install -e ".[dev]"

Code formatting

./format.sh

This runs black (target Python 3.10), autoflake (unused import removal), and isort (import sorting, black profile).

Testing transformations locally

You can test the transformer against a local directory of OAI records:

# Requires a running Invenio app context and an oai-data directory
python tests/run_transform_separately.py

Errors are written to /tmp/errors.yaml for inspection.

Adding a new source repository

To add support for harvesting from a new OAI-PMH source:

  1. Create a new sub-package under nr_oaipmh_harvesters/ (e.g., nr_oaipmh_harvesters/my_source/).
  2. Implement a transformer class extending OAIRuleTransformer from oarepo-oaipmh-harvester.
  3. Register the transformer in config.py by adding it to the DATASTREAMS_TRANSFORMERS dictionary.
  4. Register the harvester via the Invenio CLI with --transformer my_source.

Related packages

Authors

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nr_oaipmh_harvesters-1.0.76.tar.gz (43.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nr_oaipmh_harvesters-1.0.76-py3-none-any.whl (38.9 kB view details)

Uploaded Python 3

File details

Details for the file nr_oaipmh_harvesters-1.0.76.tar.gz.

File metadata

  • Download URL: nr_oaipmh_harvesters-1.0.76.tar.gz
  • Upload date:
  • Size: 43.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nr_oaipmh_harvesters-1.0.76.tar.gz
Algorithm Hash digest
SHA256 f871196a3750a5081bc229234b937a77b6d4077d730a04ac4118af865fed3507
MD5 4d3e587d8054ff11eb20358ffa320496
BLAKE2b-256 a1ec1301a205a37b6dc8958559b1716f89494dce6ba6b22a38e2cb2fbdb0d1e5

See more details on using hashes here.

File details

Details for the file nr_oaipmh_harvesters-1.0.76-py3-none-any.whl.

File metadata

File hashes

Hashes for nr_oaipmh_harvesters-1.0.76-py3-none-any.whl
Algorithm Hash digest
SHA256 bf5a4ea742485bfdcf098ac1ce6f6fd017814c4f48670b4bfd40e94e82da8dbf
MD5 1842ced160985b5fc5c5bb37a7c2c510
BLAKE2b-256 4787618e09ba1b4ffa208d37be2b62e8d998f914bbed65ebbe70171b2e989e22

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page