Skip to main content

Tool for harvesting Trove digitised newspaper articles.

Project description

TroveHarvester

This is a tool for harvesting large quantities of digitised newspaper articles from Trove.

It has been tested on MacOS and Windows 7, and should work ok with Python 2.7 and Python 3.

Installation options

No installation required!

If you want to use the harvester without installing anything, just head over to the Trove Newspaper Harvester repository in my GLAM Workbench.

Installation via Docker

Assuming you have Docker installed and running, just spin up a troveharvester container:

    $ docker run -v $(pwd):/troveharvester/data -it wragge/troveharvester /bin/bash

Note that this will store the harvested data in the current working directory on your local filesystem.

Installation via pip

Assuming you have Python and Virtualenv installed just:

    $ virtualenv mytroveharvests
    $ cd mytroveharvests
    $ source bin/activate
    $ pip install troveharvester

On Windows it should be:

    > virtualenv mytroveharvests
    > cd mytroveharvests
    > Scripts\activate
    > pip install troveharvester

Basic usage

Before you do any harvesting you need to get yourself a Trove API key.

There are three basic commands:

  • start -- start a new harvest
  • restart -- restart a stalled harvest
  • report -- view harvest details

Start a harvest

To start a new harvest you can just do:

    $ cd mytroveharvests
    $ source bin/activate
    $ troveharvester start "[Trove query]" [Trove API key]

Or on Windows:

    > cd mytroveharvests
    > Scripts\activate
    > troveharvester start "[Trove query]" [Trove API key]

The Trove query can either be a url copy and pasted from a search in the Trove web interface, or a Trove API query url constructed using something like the Trove API Console. Enclose the url in double quotes.

A data directory will be automatically created to hold all of your harvests. Each harvest will be saved into a directory named with a current timestamp. Details of harvested articles are written to a CSV file named results.csv. The harvest configuration details are also saved to a metadata.json file.

Options:

--max [integer] specify a maximum number of articles to harvest (multiples of 20)

--pdf save a copy of each each as a PDF (this makes the harvest a lot slower as you have to allow a couple of seconds for each PDF to generate)

--text save the OCRd text of each article into a separate .txt file

Restart a harvest

Things go wrong and harvests get interrupted. If your harvest stops before it should, you can just do:

    $ troveharvester restart

By default the script will try to restart the most recent harvest. You can also restart an earlier harvest:

    $ troveharvester restart --harvest [harvest timestamp]

Get a summary of a harvest

If you'd like to quickly check the status of a harvest, just try:

    $ troveharvester report

By default the script will report on the most recent harvest. You can get a summary for an earlier harvest:

    $ troveharvester report --harvest [harvest timestamp]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

troveharvester-0.2.2.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

troveharvester-0.2.2-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file troveharvester-0.2.2.tar.gz.

File metadata

  • Download URL: troveharvester-0.2.2.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.12.3 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.29.1 CPython/2.7.14

File hashes

Hashes for troveharvester-0.2.2.tar.gz
Algorithm Hash digest
SHA256 1e790a44c79534f9b5ba31a57704b4e73410dbde0eb800a63394866ee49bbe53
MD5 3b8f3ba89b39ffbb5be40c5a7c4f728c
BLAKE2b-256 a88cdd6c4aa77d72c144c0197080a44057b3f390e5447fff199b66d06c419ccc

See more details on using hashes here.

File details

Details for the file troveharvester-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: troveharvester-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.12.3 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.29.1 CPython/2.7.14

File hashes

Hashes for troveharvester-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b372bad765ed18e423ace7e792e919b4af516b8ba3ce4e0fcee60838452a214d
MD5 144b042e97f9639d0de4a25a3782f255
BLAKE2b-256 001257284df3a8555ce2e227058017a631b6ea142be698a92e32a58bdb3a299a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page