Tool for harvesting Trove digitised newspaper articles.
Project description
TroveHarvester
This is a tool for harvesting large quantities of digitised newspaper articles from Trove.
It has been tested on MacOS and Windows 7, and should work ok with Python 3.
Installation options
No installation required!
If you want to use the harvester without installing anything, just head over to the Trove Newspaper Harvester section in my GLAM Workbench.
Installation via pip
Assuming you have Python 3 installed just:
$ python3 -m venv mytroveharvests
$ cd mytroveharvests
$ source bin/activate
$ pip install troveharvester
Basic usage
Before you do any harvesting you need to get yourself a Trove API key.
There are three basic commands:
- start -- start a new harvest
- restart -- restart a stalled harvest
- report -- view harvest details
Start a harvest
To start a new harvest you can just do:
$ cd mytroveharvests
$ source bin/activate
$ troveharvester start "[Trove query]" [Trove API key]
Or on Windows:
> cd mytroveharvests
> Scripts\activate
> troveharvester start "[Trove query]" [Trove API key]
The Trove query can either be a url copy and pasted from a search in the Trove web interface, or a Trove API query url constructed using something like the Trove API Console. Enclose the url in double quotes.
A data
directory will be automatically created to hold all of your harvests. Each harvest will be saved into a directory named with a current timestamp. Details of harvested articles are written to a CSV file named results.csv
. The harvest configuration details are also saved to a metadata.json
file.
Options:
--max [integer] specify a maximum number of articles to harvest (multiples of 20)
--pdf save a copy of each each as a PDF (this makes the harvest a lot slower as you have to allow a couple of seconds for each PDF to generate)
--text
save the OCRd text of each article into a separate .txt
file
--image
save an image of each article into a separate .jpg
file (if the article is split over more than one page there will be multiple images)
--include_linebreaks preserve linebreaks in saved text files
Restart a harvest
Things go wrong and harvests get interrupted. If your harvest stops before it should, you can just do:
$ troveharvester restart
By default the script will try to restart the most recent harvest. You can also restart an earlier harvest:
$ troveharvester restart --harvest [harvest timestamp]
Get a summary of a harvest
If you'd like to quickly check the status of a harvest, just try:
$ troveharvester report
By default the script will report on the most recent harvest. You can get a summary for an earlier harvest:
$ troveharvester report --harvest [harvest timestamp]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for troveharvester-0.5.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f88a16cb10f574b85b5b93f2e1d7d9491bfb241e06eeb1666cc7e163b0a4fb2f |
|
MD5 | 8572821b70bfff1125b383d84bfdbe67 |
|
BLAKE2b-256 | 0c0bff91af2e5eab38e7adaa8e73c1d72f191c578ba90926926d06292cd56d15 |