Tool for extracting machine-readable data from the National Archives of Australia online database, Recordsearch.
Project description
RecordSearch Data Scraper
The National Archives of Australia’s online database, RecordSearch, contains lots of rich, historical data. Unfortunately there’s no API, so we have to resort to screen scrapers to get it out in reusable form. This is a library of scrapers to extract data about the main entities in RecordSearch – Items, Series, and Agencies – from both individual records, and search results.
The main classes are:
RSItem()
– an individual itemRSItemSearch()
– an advanced search for itemsRSSeries()
– an individual seriesRSSeriesSearch()
– an advanced search for seriesRSAgency()
– an individual agencyRSAgencySearch()
– an advanced search for agencies
RecordSearch makes use of an odd assortment of sessions, redirects, and hidden forms, which make scraping a challenge. Please let me know if something isn’t working as expected, as problems can be difficult to pin down!
This is a replacement for the original Recordsearch_tools library. The main changes are:
- Requirements have been updated (dropping RoboBrowser which seems to be no longer maintained)
- The full range of search parameters are now supported for Items, Series, and Agencies
- There’s a built-in cache for improved efficiency and speed
See the documentation for more details. And check out the RecordSearch section of the GLAM Workbench for examples of what’s possible.
Install
pip install recordsearch-data-scraper
How to use
Retrieve an item using its Item ID.
from recordsearch_data_scraper.scrapers import *
item = RSItem('3445411')
View the item data.
item.data
{'title': 'WRAGGE Clement Lionel Egerton : SERN 647 : POB Cheadle England : POE Enoggera QLD : NOK (Father) WRAGGE Clement Lindley',
'identifier': '3445411',
'series': 'B2455',
'control_symbol': 'WRAGGE C L E',
'digitised_status': True,
'digitised_pages': 47,
'access_status': 'Open',
'access_decision_reasons': [],
'location': 'Canberra',
'retrieved': '2021-04-25T21:12:22.620414+10:00',
'contents_date_str': '1914 - 1920',
'contents_start_date': '1914',
'contents_end_date': '1920',
'access_decision_date_str': '12 Apr 2001',
'access_decision_date': '2001-04-12'}
Search for items.
search = RSItemSearch(kw='wragge')
View the total number of items in the results set.
search.total_results
209
Access the first page of results.
items = search.get_results()
View the first result.
items['results'][0]
{'series': 'A2479',
'control_symbol': '17/1306',
'title': 'The Wragge Estate. Property for sale.',
'identifier': '149309',
'access_status': 'Open',
'location': 'Canberra',
'contents_date_str': '1917 - 1917',
'contents_start_date': '1917',
'contents_end_date': '1917',
'digitised_status': True}
The Series and Agency classes follow exactly the same pattern. See the docs for more examples.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file recordsearch_data_scraper-0.1.0.tar.gz
.
File metadata
- Download URL: recordsearch_data_scraper-0.1.0.tar.gz
- Upload date:
- Size: 22.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4bf9c96d72fc4fa8e48a868470e32616bfbee8c5e29d038fe3d7ec57f58c43e5 |
|
MD5 | b1ed0108ad313a5ada462dd71db3d996 |
|
BLAKE2b-256 | 051c72e52aade4af6da90ab8c3501e1030e1da69a57440ca365c7de1ace47c45 |
File details
Details for the file recordsearch_data_scraper-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: recordsearch_data_scraper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 022c6613d416c498cd97a4799bee65b5ea03875268daebc12dd440fc2c77af71 |
|
MD5 | cf656f948c030a4d38f5b04c930d1a2d |
|
BLAKE2b-256 | 3dff0f831ad96c6e17c717a0ae249fb9e8018408777716088b081963fc47e0f1 |