Colav OAI-PMH Harvester
Project description
Oxomo
Colav OAI-PMH Harvesting / Goddess of the night, the astrology and the calendar.
Description
Package to download metadata records for repositories using OAI-PMH. Supports:
- Download XML records using OAI-PMH protocol.
- Download XML records in multiple XML schemas.
- Parallel execution, to download multiple repositories at the same time.
- Rate-Limit to avoid DDoS and 429 errors, this is supported asynchronous in the parallel execution, which means that every repo can have a different rate limit.
- Allows parse the XML as dictionary without losing information thanks to the package xmltodict, allowing at the same time, saving the records in MongoDB.
- Command line tool oxomoc_run.
- CheckPoint to save the state of the execution. This feature is available suing to different algorithms, selective or not. Which means that we can create a checkpoint using (from/until) in the verb ListIdentifiers. This is because not all endpoints has support for this.
Installation
MongoDB
This package requires a MongoDB engine to save the results. Please read https://www.mongodb.com/docs/manual/administration/install-community/
Package
pip install oxomoc
Usage
Create a config file ex: config.py
Read the comments in the next one for more information.
endpoints = {}
endpoints["dspace_udea"] = {}
endpoints["dspace_udea"]["enabled"] = True #if this endpoint is enabled
endpoints["dspace_udea"]["url"] = "http://bibliotecadigital.udea.edu.co/oai/request"
endpoints["dspace_udea"]["metadataPrefix"] = "dim" #xml format, check if the list in the repository using
endpoints["dspace_udea"]["rate_limit"] = {"calls": 10000, "secs": 1}
endpoints["dspace_udea"]["checkpoint"] = {}
endpoints["dspace_udea"]["checkpoint"]["enabled"] = True
# uses selective harvesting to create the checkpoint.
# check http://www.openarchives.org/OAI/openarchivesprotocol.html#SelectiveHarvesting
endpoints["dspace_udea"]["checkpoint"]["selective"] = True
endpoints["dspace_udea"]["checkpoint"]["days"] = 30 # if selective, time step
endpoints["dspace_uext"] = {}
endpoints["dspace_uext"]["enabled"] = True
endpoints["dspace_uext"]["url"] = "http://bdigital.uexternado.edu.co/oai/request"
endpoints["dspace_uext"]["metadataPrefix"] = "dim"
endpoints["dspace_uext"]["rate_limit"] = {
"calls": 1000, "secs": 1} # calls per second
endpoints["dspace_uext"]["checkpoint"] = {}
endpoints["dspace_uext"]["checkpoint"]["enabled"] = True
endpoints["dspace_uext"]["checkpoint"]["selective"] = True
endpoints["dspace_uext"]["checkpoint"]["days"] = 30
We suggest to use selective checkpoint if supported by the repository, it is more efficient.
To execute it run:
oxomo_run --config config.py
By default:
- it will run in parallel with 2 threads because there is 2 endpoints, if there is more endpoints it will try to use the maximum number of threads available. Please use
--max_thread
parameter to control the parallel execution. - it will try to connect to local MongoDB instance without credentials.
- The database with the results is oxomo.
The collections produced are:
dspace_udea_identifiers
dspace_udea_identity
dspace_udea_invalid
dspace_udea_errors
dspace_udea_records
where:
- dspace_udea_identifiers: is the list of identifiers for the checkpoints, additional useful information can be found here such as deleted records and setSpec for every record id
- dspace_udea_identity: information of the repository using the verb Identify
- dspace_udea_invalid: records that are not marked as deleted by the repository but it is returning id doesn´t exists or some other OAI-PMH error
- dspace_udea_errors: if there is and error in the request such as 500 or 429 the error is saved in this collection.
- dspace_udea_records: all the records correctly downloaded.
Please check oxomo_run for more options.
License
BSD-3-Clause License
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file Oxomoc-0.0.1.tar.gz
.
File metadata
- Download URL: Oxomoc-0.0.1.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44a2ea3985044aba83c19eb904ef6014fc794e601e29c7fca0f17e66a21e0664 |
|
MD5 | 18156b8e6dd3bed3163b3f3f3fa2f775 |
|
BLAKE2b-256 | cf9380ce8722828d495d5e542072756c91126629345296c255b22c6192414a2d |
File details
Details for the file Oxomoc-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: Oxomoc-0.0.1-py3-none-any.whl
- Upload date:
- Size: 12.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 888519efd1e73d2eab3d68c0208629dc391ea7644306a4fd4df8c249b003c194 |
|
MD5 | fe1f8d11035bd30d911fbe4a516162ff |
|
BLAKE2b-256 | f7db77f8c4a8ff28a1d8dcc7690dea3d7b41902c25a57766c04b3cd5be1e19cc |