Skip to main content

Tool for bulk harvests of digitised newspaper articles from Trove

Project description

trove-newspaper-harvester

View the full documentation

The Trove Newspaper (& Gazette) Harvester makes it easy to download large quantities of digitised articles from Trove’s newspapers and gazettes. Just give it a search from the Trove web interface, and the harvester will save the metadata of all the articles in a CSV (spreadsheet) file for further analysis. You can also save the full text of every article, as well as copies of the articles as JPG images, and even PDFs. While the web interface will only show you the first 2,000 results matching your search, the Newspaper Harvester will get everything.

No installation required!

If you want to use the harvester without installing anything, just head over to the Trove Newspaper Harvester section in my GLAM Workbench.

Installation

pip install trove-newspaper-harvester

Before you do any harvesting you need to get yourself a Trove API key.

Use as a library

from trove_newspaper_harvester.core import prepare_query, Harvester

Generate a set of query parameters using prepare_query.

my_query = "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge"
my_api_key = "mYSecREtkEy"

my_query_params = prepare_query(query=my_query)

Initialise the Harvester with your query parameters and api key.

harvester = Harvester(query_params=my_query_params, key=my_api_key)

Start the harvest!

harvester.harvest()

If the harvest fails just run Harvester.harvest again.

See the core module documentation for more options and examples.

Use as a command-line tool

There are three basic commands:

  • start – start a new harvest
  • restart – restart a stalled harvest
  • report – view harvest details

Start a harvest

To start a new harvest you can just do:

troveharvester start "[Trove query]" [Trove API key]

The Trove query can either be a url copied and pasted from a search in the Trove web interface, or a Trove API query url constructed using something like the Trove API Console. Enclose the url in double quotes.

See the CLI module documentation for more details.


Created by Tim Sherratt for the GLAM Workbench. Support this project by becoming a GitHub sponsor.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trove-newspaper-harvester-0.7.2.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trove_newspaper_harvester-0.7.2-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file trove-newspaper-harvester-0.7.2.tar.gz.

File metadata

File hashes

Hashes for trove-newspaper-harvester-0.7.2.tar.gz
Algorithm Hash digest
SHA256 403463d21d1b611fd1a38619bf505ac93771d7ed92bdf83617b9678b4d1feb23
MD5 b6fd200d4c0a286f6a18e69e612853d4
BLAKE2b-256 9d98973eef4f8ae16318b9d4ea8f31be9a8fb017fdc5645f70538f106f5fd566

See more details on using hashes here.

File details

Details for the file trove_newspaper_harvester-0.7.2-py3-none-any.whl.

File metadata

File hashes

Hashes for trove_newspaper_harvester-0.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e6eafdb8ec732de84bb6c9d0e0e1c65c91f4fa44f6888c11ec4491f345298f78
MD5 ffc46bba732d5171f37a6be3cd314191
BLAKE2b-256 d2b1d95353398994f097e299d4001e39d0d7f0210a5b4260e4ed54bf5657b730

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page