Tool for harvesting Trove digitised newspaper articles.

Project description

TroveHarvester

This is a tool for harvesting large quantities of digitised newspaper articles from Trove.

It has been tested on MacOS and Windows 7, and should work ok with Python 3.

Installation options

No installation required!

If you want to use the harvester without installing anything, just head over to the Trove Newspaper Harvester section in my GLAM Workbench.

Installation via pip

Assuming you have Python 3 installed just:

    $ python3 -m venv mytroveharvests
    $ cd mytroveharvests
    $ source bin/activate
    $ pip install troveharvester

Basic usage

Before you do any harvesting you need to get yourself a Trove API key.

There are three basic commands:

start -- start a new harvest
restart -- restart a stalled harvest
report -- view harvest details

Start a harvest

To start a new harvest you can just do:

    $ cd mytroveharvests
    $ source bin/activate
    $ troveharvester start "[Trove query]" [Trove API key]

Or on Windows:

    > cd mytroveharvests
    > Scripts\activate
    > troveharvester start "[Trove query]" [Trove API key]

The Trove query can either be a url copy and pasted from a search in the Trove web interface, or a Trove API query url constructed using something like the Trove API Console. Enclose the url in double quotes.

A data directory will be automatically created to hold all of your harvests. Each harvest will be saved into a directory named with a current timestamp. Details of harvested articles are written to a CSV file named results.csv. The harvest configuration details are also saved to a metadata.json file.

Options:

--max [integer] specify a maximum number of articles to harvest (multiples of 20)

--pdf save a copy of each each as a PDF (this makes the harvest a lot slower as you have to allow a couple of seconds for each PDF to generate)

--text save the OCRd text of each article into a separate .txt file

--image save an image of each article into a separate .jpg file (if the article is split over more than one page there will be multiple images)

--include_linebreaks preserve linebreaks in saved text files

Restart a harvest

Things go wrong and harvests get interrupted. If your harvest stops before it should, you can just do:

    $ troveharvester restart

By default the script will try to restart the most recent harvest. You can also restart an earlier harvest:

    $ troveharvester restart --harvest [harvest timestamp]

Get a summary of a harvest

If you'd like to quickly check the status of a harvest, just try:

    $ troveharvester report

By default the script will report on the most recent harvest. You can get a summary for an earlier harvest:

    $ troveharvester report --harvest [harvest timestamp]

Project details

Release history Release notifications | RSS feed

0.5.2

Sep 22, 2022

0.5.1

Aug 29, 2022

0.5.0

Jun 22, 2022

0.4.2

Apr 23, 2021

This version

0.4.1

Mar 7, 2021

0.4.0

Nov 25, 2020

0.3.3

Sep 23, 2020

0.3.2

Jul 29, 2020

0.3.1

Jun 28, 2020

0.3.0

Jun 28, 2020

0.2.3

Feb 4, 2020

0.2.2

Jan 19, 2019

0.2.1

Jan 18, 2019

0.1.13

May 10, 2018

0.1.12

Feb 23, 2018

0.1.11

Feb 23, 2018

0.1.10

Apr 14, 2017

0.1.8

Dec 3, 2016

0.1.7

Dec 3, 2016

0.1.6

Nov 5, 2016

0.1.5

May 23, 2016

0.1.4

Apr 25, 2016

0.1.3

Apr 25, 2016

0.1.2

Apr 25, 2016

0.1.1

Apr 24, 2016

0.1.0

Apr 23, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

troveharvester-0.4.1.tar.gz (14.0 kB view details)

Uploaded Mar 7, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

troveharvester-0.4.1-py3-none-any.whl (13.1 kB view details)

Uploaded Mar 7, 2021 Python 3

File details

Details for the file troveharvester-0.4.1.tar.gz.

File metadata

Download URL: troveharvester-0.4.1.tar.gz
Upload date: Mar 7, 2021
Size: 14.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.5

File hashes

Hashes for troveharvester-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`af72c1e2715d68d0bfeec3ca99404a2ecdcf3e4a4ae848a9f9901ee4daa727a3`
MD5	`1e38c7ba3d573ba4072f2bace0388f6e`
BLAKE2b-256	`5aa16754aa5e7b4f44a8a80d970101a42edea14fa16f0039f5fe993140cbe300`

See more details on using hashes here.

File details

Details for the file troveharvester-0.4.1-py3-none-any.whl.

File metadata

Download URL: troveharvester-0.4.1-py3-none-any.whl
Upload date: Mar 7, 2021
Size: 13.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.5

File hashes

Hashes for troveharvester-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8d1577325d6a67b1caf505f663ac1f0b2960e90ded724f71d78ccb89b059496b`
MD5	`d909f865d1cc3a64e38cb0f1b2332729`
BLAKE2b-256	`8055c4b1c96cd5044af948e75433ef1700900de15fb226e7381ebdb6cd5e5907`

See more details on using hashes here.

troveharvester 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

TroveHarvester

Installation options

No installation required!

Installation via pip

Basic usage

Start a harvest

Restart a harvest

Get a summary of a harvest

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes