Tool for harvesting Trove digitised newspaper articles.
This is a tool for harvesting large quantities of digitised newspaper articles from Trove.
Assuming you have Python and Virtualenv installed just:
$ virtualenv mytroveharvests $ cd mytroveharvests $ source bin/activate $ pip install troveharvester
Before you do any harvesting you need to get yourself a Trove API key.
There are three basic commands:
start – start a new harvest
restart – restart a stalled harvest
report – view harvest details
Start a harvest
To start a new harvest you can just do:
$ cd mytroveharvests $ source bin/activate $ troveharvester start "[Trove query]" [Trove API key]
The Trove query can either be a url copy and pasted from a search in the Trove web interface, or a Trove API query url constructed using something like the Trove API Console. Enclose the url in double quotes.
A data directory will be automatically created to hold all of your harvests. Each harvest will be saved into a directory named with a current timestamp. Details of harvested articles are written to a CSV file named results.csv. The harvest configuration details are also saved to a metadata.json file.
- –max [integer]
specify a maximum number of articles to harvest (multiples of 20)
save a copy of each each as a PDF (this makes the harvest a lot slower as you have to allow a couple of seconds for each PDF to generate)
save the OCRd text of each article into a separate .txt file
Restart a harvest
Things go wrong and harvests get interrupted. If your harvest stops before it should, you can just do:
$ troveharvester restart
By default the script will try to restart the most recent harvest. You can also restart an earlier harvest:
$ troveharvester restart --harvest [harvest timestamp]
Get a summary of a harvest
If you’d like to quickly check the status of a harvest, just try:
$ troveharvester report
By default the script will report on the most recent harvest. You can get a summary for an earlier harvest:
$ troveharvester report --harvest [harvest timestamp]
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.