This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!

Tool for harvesting Trove digitised newspaper articles.

Project Description

This is a tool for harvesting large quantities of digitised newspaper articles from Trove.

It has been tested on MacOSX and Windows 7, and should work ok with Python 2.7 and Python 3.

Installation

Assuming you have Python and Virtualenv installed just:

$ virtualenv mytroveharvests
$ cd mytroveharvests
$ source bin/activate
$ pip install troveharvester

On Windows it should be:

> virtualenv mytroveharvests
> cd mytroveharvests
> Scripts\activate
> pip install troveharvester

Basic usage

Before you do any harvesting you need to get yourself a Trove API key.

There are three basic commands:

  • start – start a new harvest
  • restart – restart a stalled harvest
  • report – view harvest details

Start a harvest

To start a new harvest you can just do:

$ cd mytroveharvests
$ source bin/activate
$ troveharvester start "[Trove query]" [Trove API key]

Or on Windows:

> cd mytroveharvests
> Scripts\activate
> troveharvester start "[Trove query]" [Trove API key]

The Trove query can either be a url copy and pasted from a search in the Trove web interface, or a Trove API query url constructed using something like the Trove API Console. Enclose the url in double quotes.

A data directory will be automatically created to hold all of your harvests. Each harvest will be saved into a directory named with a current timestamp. Details of harvested articles are written to a CSV file named results.csv. The harvest configuration details are also saved to a metadata.json file.

Options:

–max [integer]
specify a maximum number of articles to harvest (multiples of 20)
–pdf
save a copy of each each as a PDF (this makes the harvest a lot slower as you have to allow a couple of seconds for each PDF to generate)
–text
save the OCRd text of each article into a separate .txt file

Restart a harvest

Things go wrong and harvests get interrupted. If your harvest stops before it should, you can just do:

$ troveharvester restart

By default the script will try to restart the most recent harvest. You can also restart an earlier harvest:

$ troveharvester restart --harvest [harvest timestamp]

Get a summary of a harvest

If you’d like to quickly check the status of a harvest, just try:

$ troveharvester report

By default the script will report on the most recent harvest. You can get a summary for an earlier harvest:

$ troveharvester report --harvest [harvest timestamp]
Release History

Release History

This version
History Node

0.1.10

History Node

0.1.8

History Node

0.1.7

History Node

0.1.6

History Node

0.1.5

History Node

0.1.4

History Node

0.1.3

History Node

0.1.2

History Node

0.1.1

History Node

0.1.0

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
troveharvester-0.1.10.tar.gz (13.3 kB) Copy SHA256 Checksum SHA256 Source Apr 14, 2017

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting