Skip to main content

Colav OAI-PMH Harvester

Project description

Python build package

Oxomo

Colav OAI-PMH Harvesting / Goddess of the night, the astrology and the calendar.

Description

Package to download metadata records for repositories using OAI-PMH. Supports:

  • Download XML records using OAI-PMH protocol.
  • Download XML records in multiple XML schemas.
  • Parallel execution, to download multiple repositories at the same time.
  • Rate-Limit to avoid DDoS and 429 errors, this is supported asynchronous in the parallel execution, which means that every repo can have a different rate limit.
  • Allows parse the XML as dictionary without losing information thanks to the package xmltodict, allowing at the same time, saving the records in MongoDB.
  • Command line tool oxomoc_run.
  • CheckPoint to save the state of the execution. This feature is available suing to different algorithms, selective or not. Which means that we can create a checkpoint using (from/until) in the verb ListIdentifiers. This is because not all endpoints has support for this.

Installation

MongoDB

This package requires a MongoDB engine to save the results. Please read https://www.mongodb.com/docs/manual/administration/install-community/

Package

pip install oxomoc

Usage

Create a config file ex: config.py
Read the comments in the next one for more information.

endpoints = {}
endpoints["dspace_udea"] = {}
endpoints["dspace_udea"]["enabled"] = True #if this endpoint is enabled
endpoints["dspace_udea"]["url"] = "http://bibliotecadigital.udea.edu.co/oai/request"
endpoints["dspace_udea"]["metadataPrefix"] = "dim"  #xml format, check if the list in the repository using
endpoints["dspace_udea"]["rate_limit"] = {"calls": 10000, "secs": 1}
endpoints["dspace_udea"]["checkpoint"] = {}
endpoints["dspace_udea"]["checkpoint"]["enabled"] = True
# uses selective harvesting to create the checkpoint.
# check http://www.openarchives.org/OAI/openarchivesprotocol.html#SelectiveHarvesting
endpoints["dspace_udea"]["checkpoint"]["selective"] = True
endpoints["dspace_udea"]["checkpoint"]["days"] = 30  # if selective, time step

endpoints["dspace_uext"] = {}
endpoints["dspace_uext"]["enabled"] = True
endpoints["dspace_uext"]["url"] = "http://bdigital.uexternado.edu.co/oai/request"
endpoints["dspace_uext"]["metadataPrefix"] = "dim"
endpoints["dspace_uext"]["rate_limit"] = {
    "calls": 1000, "secs": 1}  # calls per second
endpoints["dspace_uext"]["checkpoint"] = {}
endpoints["dspace_uext"]["checkpoint"]["enabled"] = True
endpoints["dspace_uext"]["checkpoint"]["selective"] = True
endpoints["dspace_uext"]["checkpoint"]["days"] = 30

We suggest to use selective checkpoint if supported by the repository, it is more efficient.

To execute it run:

oxomo_run --config config.py

By default:

  • it will run in parallel with 2 threads because there is 2 endpoints, if there is more endpoints it will try to use the maximum number of threads available. Please use --max_thread parameter to control the parallel execution.
  • it will try to connect to local MongoDB instance without credentials.
  • The database with the results is oxomo.

The collections produced are:

dspace_udea_identifiers
dspace_udea_identity
dspace_udea_invalid
dspace_udea_errors
dspace_udea_records

where:

  • dspace_udea_identifiers: is the list of identifiers for the checkpoints, additional useful information can be found here such as deleted records and setSpec for every record id
  • dspace_udea_identity: information of the repository using the verb Identify
  • dspace_udea_invalid: records that are not marked as deleted by the repository but it is returning id doesn´t exists or some other OAI-PMH error
  • dspace_udea_errors: if there is and error in the request such as 500 or 429 the error is saved in this collection.
  • dspace_udea_records: all the records correctly downloaded.

Please check oxomo_run for more options.

License

BSD-3-Clause License

Links

http://colav.udea.edu.co/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Oxomoc-0.0.1.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

Oxomoc-0.0.1-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file Oxomoc-0.0.1.tar.gz.

File metadata

  • Download URL: Oxomoc-0.0.1.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.0

File hashes

Hashes for Oxomoc-0.0.1.tar.gz
Algorithm Hash digest
SHA256 44a2ea3985044aba83c19eb904ef6014fc794e601e29c7fca0f17e66a21e0664
MD5 18156b8e6dd3bed3163b3f3f3fa2f775
BLAKE2b-256 cf9380ce8722828d495d5e542072756c91126629345296c255b22c6192414a2d

See more details on using hashes here.

File details

Details for the file Oxomoc-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: Oxomoc-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.0

File hashes

Hashes for Oxomoc-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 888519efd1e73d2eab3d68c0208629dc391ea7644306a4fd4df8c249b003c194
MD5 fe1f8d11035bd30d911fbe4a516162ff
BLAKE2b-256 f7db77f8c4a8ff28a1d8dcc7690dea3d7b41902c25a57766c04b3cd5be1e19cc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page