Highly configurable oai-harvester based on sickle.
Project description
oaipmharvest
Description
oaipmharvest is a harvester for OAI-PMH written in python and based on sickle (for now). It's special focus lies on support for advanced non-standard use cases. If you just need a the standard feature set, you might be better off with something more mature and better tested.
The oaipmharvester will connect to a given OAI-endpoint and store its responses in a given output folder. It enables you to make incremental requests from the given OAI-endpoint or even restrict the result set by a given date. In addition to that, it provides several features to dynamically construct set specifiers from smaller parts.
This is an alpha release. Use with caution.
Features
- Configuration via TOML
- Advanced configuration support for dynamic sets (for e.g. those supported by BASE)
Installation
After cloning the git repository locally, set up a virtual environment and run
pip install .
Running
For running the application, you can call after the installation the CLI command oaipm_harvest
, which also provides a help function
by calling oaipm_harvester -h
.
usage: oaipm_harvest [-h] [--from FROM] [--until UNTIL] file
positional arguments:
file Config file (TOML)
optional arguments:
-h, --help show this help message and exit
--from FROM, -f FROM Harvest only items that where published after the specified date
--until UNTIL, -u UNTIL
Harvest only items that where published before the specified date
To harvest a specific OAI, you have to provide a conf-file. An example conf-file for the
most basic use case could be conf/my-journal.conf
and would contain, for example:
endpoint_url = "https://www.contributions-to-entomology.org/oai/"
metadata_prefixes = ["marcxml"]
out_dir = "./out_cte"
use_sets = false
where
endpoint_url is the OAI-base-URL you want to connect to.
metadata_prefixes is a list of formats you want to download. The format is simply handed to the OAI-interface and, hence, it depends on the OAI-interface, if it supports the given format or not.
out_dir is the directory, where all the downloaded data will be stored. If the given folder(s) do not exists, they will be created.
use_sets false
Licence
All parts of this code are copyrighted by the University Library JCS, Frankfurt a. M. The project is made available under the Mozilla Public License 2.0.
Acknowledgement
This is a project created and maintained by the Specialised Information Service for Linguistics at the University Library J. C. Senckenberg and funded by the German Research Foundation (DFG; project identifier 326024153).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for oaipmharvest-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2546a1ab4d604920d01737a08e763b28123122707d49a4d961695382591b14ee |
|
MD5 | 883597f3cfdad703d5bda97cb932d04c |
|
BLAKE2b-256 | 1ce3776abe919a78d5dff2cffd2fa1dcc67751f7003ac2b4c28d87630044574c |