Skip to main content

A harvester to collect records from an OAI-PMH enabled provider.

Project description

https://travis-ci.org/bloomonkey/oai-harvest.svg?branch=master Latest Version license:BSD

Contents

Description

A harvester to collect records from an OAI-PMH enabled provider.

The harvester can be used to carry out one-time harvesting of all records from a particular OAI-PMH provider by giving its base URL. It can also be used for selective harvesting, e.g. to harvest only records updated after, or before specified dates.

To assist in regular harvesting from one or more OAI-PMH providers, there’s a provider registry. It is possible to associate a short memorable name for a provider with its base URLs, destination directory for harvested records, and the format (metadataPrefix) in which records should be harvested. The registry will also record the date and time of the most recent harvest, and automatically add this to subsequent requests in order to avoid repeatedly harvesting unmodified records.

This could be used in conjunction with a scheduler (e.g. CRON) to maintain a reasonably up-to-date copy of the record in one or more providers. Examples of how to accomplish these tasks are available below.

Author(s)

John Harrison <john.harrison@liv.ac.uk> at the University of Liverpool

Latest Version

The latest release version is available in the Python Packages Index:

https://pypi.python.org/pypi/oaiharvest

Latest PyPI Version

Source code is under version control and available from:

http://github.com/bloomonkey/oai-harvest

Documentation

All executable commands are self documenting, i.e. you can get help on how to use them with the -h or --help option.

At this time the only additional documentation that exists can be found in this README file!

Requirements / Dependencies

Note that Python 3.x support requires pyoai 2.4.6+.

As this release is not yet available on PyPI, use pip3 install git+https://github.com/infrae/pyoai.git

Python3 support is still in beta and might have some bugs.

Installation

Users

pip install git+http://github.com/bloomonkey/oai-harvest.git#egg=oaiharvest

Developers

I recommend that you use virtualenv to isolate your development environment from system Python and any packages that may be installed there.

  1. In GitHub, fork the repository

  2. Clone your fork:

    git clone git@github.com:<username>/oai-harvest.git
  3. Setup development virtualenv using tox:

    pip install tox
    tox -e dev
  4. Activate development virtualenv:

    -nix:

    source env/bin/activate

    Windows:

    env\Scripts\activate

Bugs, Feature requests etc.

Bug reports and feature requests can be submitted to the GitHub issue tracker: http://github.com/bloomonkey/oai-harvest/issues

If you’d like to contribute code, patches etc. please email the author, or submit a pull request on GitHub.

Examples

Harvesting records from an OAI-PMH provider URL

All records

oai-harvest http://example.com/oai

Records modified since a certain date

oai-harvest --from 2013-01-01 http://example.com/oai

Records from a named set

oai-harvest --set "some:set" http://example.com/oai

Limiting the number of records to harvest

oai-harvest --limit 50 http://example.com/oai

Getting help on all available options

oai-harvest --help

OAI-PMH Provider Registry

Adding a provider

oai-reg add provider1 http://example.com/oai/1

If you don’t supply --metadataPrefix and --directory options, you will be interactively prompted to supply alternatives, or accept the defaults.

Removing an existing provider

oai-reg rm provider1 [provider2]

Listing existing providers

oai-reg list

Harvesting from OAI-PMH providers in the registry

You can harvest from one or more providers in the registry using the short names that they were registered with:

oai-harvest provider1 [provider2]

By default, this will harvest all records modified since the last harvest from each provider. You can over-ride this behavior using the --from and --until options.

You can also harvest from all providers in the registry:

oai-harvest all

Scheduling Regular Harvesting

In order to maintain a reasonably up-to-date copy of all the the records held by those providers, one could configure a scheduler to periodically harvest from all registered providers. e.g. to tell CRON to harvest all at 2am every day, one might add the following to crontab:

0 2 * * * oai-harvest all

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oaiharvest-3.0.0.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

oaiharvest-3.0.0-py2.py3-none-any.whl (19.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file oaiharvest-3.0.0.tar.gz.

File metadata

  • Download URL: oaiharvest-3.0.0.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for oaiharvest-3.0.0.tar.gz
Algorithm Hash digest
SHA256 ec8e1bee0b26f17ac1c2a422d4e007f385dbd9d4201b32a36ebf836d9fb1f7dc
MD5 98e46e5239ff1923270d2a7611a34f4f
BLAKE2b-256 0275f30cdf454fe4153d5d98fbbac4d3364b0972237ee45610206807299a2d7e

See more details on using hashes here.

File details

Details for the file oaiharvest-3.0.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for oaiharvest-3.0.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 280d3643154027163392596f1aa10121f9a9e8324b5985750b1d3a0170f34c3d
MD5 a05b2af3e77dae995c65c99c54ef9eb6
BLAKE2b-256 cc694957f4d49e8ef67caca21063fc7202bb8629703fc87aae725fd0b0b19fc7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page