Skip to main content

Colav OAI-PMH Harvester

Project description

Python build package

Oxomo

Colav OAI-PMH Harvesting / Goddess of the night, the astrology and the calendar.

Description

Package to download metadata records for repositories using OAI-PMH. Supports:

  • Download XML records using OAI-PMH protocol.
  • Download XML records in multiple XML schemas.
  • Parallel execution, to download multiple repositories at the same time.
  • Rate-Limit to avoid DDoS and 429 errors, this is supported asynchronous in the parallel execution, which means that every repo can have a different rate limit.
  • Allows parse the XML as dictionary without losing information thanks to the package xmltodict, allowing at the same time, saving the records in MongoDB.
  • Command line tool oxomoc_run.
  • CheckPoint to save the state of the execution. This feature is available using different algorithms, selective or not. Which means that we can create a checkpoint using (from/until) in the verb ListIdentifiers. This is because not all endpoints have support for this.

Installation

MongoDB

This package requires a MongoDB engine to save the results. Please read https://www.mongodb.com/docs/manual/administration/install-community/

Package

pip install oxomoc

Usage

Create a config file ex: config.py
Read the comments in the next one for more information.

endpoints = {}
endpoints["dspace_udea"] = {}
endpoints["dspace_udea"]["enabled"] = True #if this endpoint is enabled
endpoints["dspace_udea"]["url"] = "http://bibliotecadigital.udea.edu.co/oai/request"
endpoints["dspace_udea"]["metadataPrefix"] = "dim"  #xml format, check if the list in the repository using
endpoints["dspace_udea"]["rate_limit"] = {"calls": 10000, "secs": 1}
endpoints["dspace_udea"]["checkpoint"] = {}
endpoints["dspace_udea"]["checkpoint"]["enabled"] = True
# uses selective harvesting to create the checkpoint.
# check http://www.openarchives.org/OAI/openarchivesprotocol.html#SelectiveHarvesting
endpoints["dspace_udea"]["checkpoint"]["selective"] = True
endpoints["dspace_udea"]["checkpoint"]["days"] = 30  # if selective, time step

endpoints["dspace_uext"] = {}
endpoints["dspace_uext"]["enabled"] = True
endpoints["dspace_uext"]["url"] = "http://bdigital.uexternado.edu.co/oai/request"
endpoints["dspace_uext"]["metadataPrefix"] = "dim"
endpoints["dspace_uext"]["rate_limit"] = {
    "calls": 1000, "secs": 1}  # calls per second
endpoints["dspace_uext"]["checkpoint"] = {}
endpoints["dspace_uext"]["checkpoint"]["enabled"] = True
endpoints["dspace_uext"]["checkpoint"]["selective"] = True
endpoints["dspace_uext"]["checkpoint"]["days"] = 30

We suggest to use selective checkpoint if supported by the repository, it is more efficient.

To execute it run:

oxomo_run --config config.py

By default:

  • it will run in parallel with 2 threads because there is 2 endpoints, if there is more endpoints it will try to use the maximum number of threads available. Please use --max_thread parameter to control the parallel execution.
  • it will try to connect to local MongoDB instance without credentials.
  • The database with the results is oxomo.

The collections produced are:

dspace_udea_identifiers
dspace_udea_identity
dspace_udea_invalid
dspace_udea_errors
dspace_udea_records

where:

  • dspace_udea_identifiers: is the list of identifiers for the checkpoints, additional useful information can be found here such as deleted records and setSpec for every record id
  • dspace_udea_identity: information of the repository using the verb Identify
  • dspace_udea_invalid: records that are not marked as deleted by the repository but it is returning id doesn´t exists or some other OAI-PMH error
  • dspace_udea_errors: if there is and error in the request such as 500 or 429 the error is saved in this collection.
  • dspace_udea_records: all the records correctly downloaded.

Please check oxomo_run for more options.

License

BSD-3-Clause License

Links

http://colav.udea.edu.co/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oxomoc-0.1.0.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

Oxomoc-0.1.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file oxomoc-0.1.0.tar.gz.

File metadata

  • Download URL: oxomoc-0.1.0.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for oxomoc-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4488bb6e9966b130b61ff78fd1b12477e20acedcd3a5d3f1353c6fc5520630f9
MD5 750c7955f406154b736052b26f8edaa6
BLAKE2b-256 23b014507f2e18855ce15113084c630dfa399ce35a6a6672d71622ef16465fb7

See more details on using hashes here.

File details

Details for the file Oxomoc-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: Oxomoc-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for Oxomoc-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5e7f700946dbaeb43b4682b5563c4c30102a68a37ade88e2bf07e859c9b22420
MD5 7ec39d9688c923f4eaf0ac50b420e85c
BLAKE2b-256 88f2f5f72f2c40c37aee73f2ac084e09eaae7ba675990d91ea026f2d6e75e94e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page