Skip to main content

Tool for extracting machine-readable data from the National Archives of Australia online database, Recordsearch.

Project description

RecordSearch Data Scraper

The National Archives of Australia’s online database, RecordSearch, contains lots of rich, historical data. Unfortunately there’s no API, so we have to resort to screen scrapers to get it out in reusable form. This is a library of scrapers to extract data about the main entities in RecordSearch – Items, Series, and Agencies – from both individual records, and search results.

The main classes are:

  • RSItem() – an individual item
  • RSItemSearch() – an advanced search for items
  • RSSeries() – an individual series
  • RSSeriesSearch() – an advanced search for series
  • RSAgency() – an individual agency
  • RSAgencySearch() – an advanced search for agencies

RecordSearch makes use of an odd assortment of sessions, redirects, and hidden forms, which make scraping a challenge. Please let me know if something isn’t working as expected, as problems can be difficult to pin down!

This is a replacement for the original Recordsearch_tools library. The main changes are:

  • Requirements have been updated (dropping RoboBrowser which seems to be no longer maintained)
  • The full range of search parameters are now supported for Items, Series, and Agencies
  • There’s a built-in cache for improved efficiency and speed

See the documentation for more details. And check out the RecordSearch section of the GLAM Workbench for examples of what’s possible.

Install

pip install recordsearch-data-scraper

How to use

Retrieve an item using its Item ID.

from recordsearch_data_scraper.scrapers import *

item = RSItem('3445411')

View the item data.

item.data
{'title': 'WRAGGE Clement Lionel Egerton : SERN 647 : POB Cheadle England : POE Enoggera QLD : NOK  (Father) WRAGGE Clement Lindley',
 'identifier': '3445411',
 'series': 'B2455',
 'control_symbol': 'WRAGGE C L E',
 'digitised_status': True,
 'digitised_pages': 47,
 'access_status': 'Open',
 'access_decision_reasons': [],
 'location': 'Canberra',
 'retrieved': '2021-04-25T21:12:22.620414+10:00',
 'contents_date_str': '1914 - 1920',
 'contents_start_date': '1914',
 'contents_end_date': '1920',
 'access_decision_date_str': '12 Apr 2001',
 'access_decision_date': '2001-04-12'}

Search for items.

search = RSItemSearch(kw='wragge')

View the total number of items in the results set.

search.total_results
209

Access the first page of results.

items = search.get_results()

View the first result.

items['results'][0]
{'series': 'A2479',
 'control_symbol': '17/1306',
 'title': 'The Wragge Estate. Property for sale.',
 'identifier': '149309',
 'access_status': 'Open',
 'location': 'Canberra',
 'contents_date_str': '1917 - 1917',
 'contents_start_date': '1917',
 'contents_end_date': '1917',
 'digitised_status': True}

The Series and Agency classes follow exactly the same pattern. See the docs for more examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

recordsearch_data_scraper-0.1.0.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

recordsearch_data_scraper-0.1.0-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file recordsearch_data_scraper-0.1.0.tar.gz.

File metadata

  • Download URL: recordsearch_data_scraper-0.1.0.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12

File hashes

Hashes for recordsearch_data_scraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4bf9c96d72fc4fa8e48a868470e32616bfbee8c5e29d038fe3d7ec57f58c43e5
MD5 b1ed0108ad313a5ada462dd71db3d996
BLAKE2b-256 051c72e52aade4af6da90ab8c3501e1030e1da69a57440ca365c7de1ace47c45

See more details on using hashes here.

File details

Details for the file recordsearch_data_scraper-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: recordsearch_data_scraper-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12

File hashes

Hashes for recordsearch_data_scraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 022c6613d416c498cd97a4799bee65b5ea03875268daebc12dd440fc2c77af71
MD5 cf656f948c030a4d38f5b04c930d1a2d
BLAKE2b-256 3dff0f831ad96c6e17c717a0ae249fb9e8018408777716088b081963fc47e0f1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page