A tool for working with archival description for public access.
Project description
description_harvester
A tool for working with archival description for public access. description_harvester reads archival description into a minimalist data model for public-facing archival description and then converts it to the ArcLight data model and POSTs it into an ArcLight Solr index using PySolr.
description_harvester is designed to be extensible and harvest archival description from a number of sources. Currently the only available sources harvests data from the ArchivesSpace API using ArchivesSnake or EAD 2002 XML files. Its also possible to add additional output modules to serialize description to EAD or other formats in addition to or in replace of sending description to an ArcLight Solr instance. This potential opens up new possibilities of managing description using low-barrier formats and tools.
description_harvester is designed to be a drop-in replacement for the ArcLight Traject indexer. It also includes a plugin that attempts to recognized IIIF manifests included as file versions and uses manifests to fully index digital objects from digital repositories and other sources, including item-level metadata fields, embedded text, OCR text, and transcriptions.
Tested on ASpace up to v3.5.1 but needs some better error handling. Validation is also very minimal, but there is potential to add detailed validation with jsonschema .
Installation
pip install description_harvester
First, you need to configure ArchivesSnake by creating a ~/.archivessnake.ymlfile with your API credentials as detailed by the ArchivesSnake configuration docs.
Next, you also need a ~/.description_harvester/config.yml file that lists your Solr URL and the core you want to index to. These can also be overridden with args. description_harvester reads your config.yml as utf-8, so if you're creating this file in a Windows environment you should ensure its utf-8.
solr_url: http://127.0.0.1:8983/solr
solr_core: blacklight-core
last_query: 0
cache_expiration: 3600
component_id_separator: "_"
online_content_label: "Online access"
The component_id_separator allows for a customizable separator between the collection and component IDs in ArcLight URLs. This can be set to component_id_separator: "" for pre-ArcLight v1.1.0 defaults which had no separator. This will default to _ if this setting isn't set in config.yml as ArcLight now does.
The online_content_label setting allows you to customize the label displayed for items with online content in ArcLight. The default is "Online access".
Adding custom digital object metadata
You can also add custom digital object metadata fields by adding them to your config.yml under the Solr suffix you would like them to be indexed as. These fields must match metadata fields in your IIIF manifests.
metadata:
- ssi:
- date_uploaded
- ssm:
- date_digitized
- extent
- ssim:
- legacy_id
- resource_type
- coverage
- preservation_package
- creator
- contributor
- preservation_format
- source
- tesm:
- processing_activity
- tesim:
- description
Repositories
By default, when reading from ArchivesSpace, description harvester will use the repository name stored there.
To enable the --repo argument, place a copy of your ArcLight repositories.yml file as ~/.description_harvester/repositories.yml. You can then use harvest --id mss001 --repo slug to index using the slug from repositories.yml. This will overrite the ArchivesSpace repository name.
There is also the option do customize this with a plugin.
Encoding note: While ArcLight does not explicitly read repositories.yml as utf-8, its Rails stack means that you're likely reading it in a utf-8 (non-Windows) environment. Since description_harvester enables you to index from a Windows machine, it expects your ~/.description_harvester/repositories.yml file to be utf-8.
Indexing from ArchivesSpace API to ArcLight
Once description_harvester is set up, you can index from the ASpace API to ArcLight using the to-ArcLight command.
Index by id_0
You can provide one or more IDs to index using a resource's id_0` field
harvest --id ua807
harvest --id mss123 apap106
Index by URI
You can also use integers from ASpace URIs for resource, such as 263 for https://my.aspace.edu/resources/263
harvest --uri 435
harvest --uri 1 755
Indexing by modified time
Index collections modified in the past hour: harvest --hour
Index collections modified in the past day: harvest --today
Index collections modified since las run: harvest --updated
Index collections not already in the index: harvest --new
Indexing from EAD 2002
harvest --ead path/to/ead.xml
You can also give it a directory and it will harvest all *.xml
harvest --ead path/to/ead_files
Verbose output
harvest --id ger071 -v
Caching
description_harvester will cache collections from the ArchivesSpace API, storing them by default to ~/.description_harvester/cache after they are converted to the description model. Cache time is set in seconds as cache_expiration in ~/.description_harvester/config.yml. Thus, cache_expiration: 3600 will use the cached data instead of the ArchivesSpace API for data less than 1 hour old.
You can override the cache path in config or turn caching off gobally with cache_dir: false.
cache_dir: "~/path/to/my_cache"
cache_dir: "C:/Users/username/my_cache"
cache_dir: false
Deleting collections
You can delete one or more collections using the --delete argument. This uses the Solr document ID, such as apap106 for https://my.ArcLight.edu/catalog/apap106.
harvest --delete apap101 apap301
Plugins
Plugins let you add institution-specific customization without modifying the core package. Common use cases might be:
- Customizing repository names based on collection identifiers
- Enriching digital objects with data from local systems (e.g., IIIF manifests, preservation systems)
UAlbany's local plugins may be a helpful example.
Creating a Plugin
-
Copy the template: Copy default.py to
~/.description_harvester/(or useDESCRIPTION_HARVESTER_PLUGIN_DIRenvironment variable). You can rename the file. -
Rename the class: The class name is optional and only for your readability. You could use any class name, but the
plugin_nameclass variable must be unique to your implementation.:class MyInstitutionPlugin(Plugin): plugin_name = "my_institution_plugin"
-
Implement methods: Override any combination of customization hooks:
-
custom_repository(self, resource): Customize repository names- Input: ArchivesSpace resource API object
- Output: Repository name string or
Nonefor default behavior
-
update_record_id(self, record_id, record): Customize record IDs- Input: Generated record ID string and Component object
- Output: Custom ID string or
Nonefor default behavior
-
update_dao(self, dao): Enrich digital objects- Input: DigitalObject with identifier, label, metadata, etc.
- Output: Modified DigitalObject with additional metadata
-
Plugin Discovery
Plugins are automatically loaded from (in order):
- Built-in plugins in the package (e.g.,
default.py) ~/.description_harvester/directory- Custom directory set via
DESCRIPTION_HARVESTER_PLUGIN_DIRenvironment variable
Note: The .py filename doesn't matter - plugin identification is based on the unique plugin_name class variable. You can have multiple plugin classes in a single .py file, or split them across multiple files.
Example Plugin
from description_harvester.plugins import Plugin
from description_harvester.iiif_utils import enrich_dao_from_manifest
class MyInstitutionPlugin(Plugin):
plugin_name = "my_institution"
def custom_repository(self, resource):
# Use custom names for special collections
if resource['id_0'].startswith('sc'):
return "Special Collections & Archives"
return None # Use default for others
def update_record_id(self, record_id, record):
# Add repository prefix to collection-level IDs
if record.level and record.level.lower() == "collection":
return f"sc_{record_id}"
return None # Use default for component IDs
def update_dao(self, dao):
# Enrich digital objects with IIIF manifest data
if 'manifest.json' in dao.identifier:
enrich_dao_from_manifest(dao, manifest_url=dao.identifier)
# Add custom logic
dao.metadata['institution_id'] = 'my_institution'
return dao
IIIF Utilities
For plugins working with IIIF manifests, description_harvester.iiif_utils provides helper functions:
from description_harvester.iiif_utils import (
fetch_manifest, # Fetch and parse manifest from URL
extract_text_from_manifest, # Extract OCR/transcription text
get_thumbnail_url, # Get thumbnail image URL
get_rights_statement, # Get rights/license info
extract_metadata_fields, # Get all metadata as dict
enrich_dao_from_manifest, # All-in-one enrichment
)
def update_dao(self, dao):
if 'manifest.json' in dao.identifier:
# Option 1: Use convenience function
enrich_dao_from_manifest(dao, manifest_url=dao.identifier)
# Option 2: Fine-grained control
manifest = fetch_manifest(dao.identifier)
if manifest:
dao.text_content = extract_text_from_manifest(manifest)
dao.thumbnail_href = get_thumbnail_url(manifest)
dao.rights_statement = get_rights_statement(manifest)
# Optionally add metadata fields from manifests
dao.metadata.update(extract_metadata_fields(manifest))
# Set metadata fields with whatever local logic
dao.metadata['custom_field'] = 'custom_value'
return dao
See the iiif_utils module documentation for all available functions.
SSL Bypass
You may encounter SSL certificate verification errors when fetching IIIF manifests, indicating that the server is not sending the complete certificate chain.
The recommended solution is to work with the server administrator to fix their certificate chain configuration. But if you do need to bypass SSL verification, you can do this by setting the DESCRIPTION_HARVESTER_VERIFY_SSL environment variable to "false".
Use as a library
You can also use description harvester in a script
from description_harvester import harvest
harvest(["--id", "myid001"])
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file description_harvester-0.10.0.tar.gz.
File metadata
- Download URL: description_harvester-0.10.0.tar.gz
- Upload date:
- Size: 63.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d72ddba19023fd4ec69d57e55c3e56fe92fdebc9cd872f9f686ffb7ba2494a2a
|
|
| MD5 |
3881710d700eb6809bb2c19851ea7c18
|
|
| BLAKE2b-256 |
76eb07171a6060bfbc762a8f52903e5e3de79b9c1b77363b99edd5dd8bbbae76
|
File details
Details for the file description_harvester-0.10.0-py3-none-any.whl.
File metadata
- Download URL: description_harvester-0.10.0-py3-none-any.whl
- Upload date:
- Size: 68.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca5bf7e05a1fbfbf3b5730518f7beac3ee44c4f2b469bd2993ed45dc1f3cab16
|
|
| MD5 |
849f607465be218bdf0b427b9904c80d
|
|
| BLAKE2b-256 |
fb182646980912ae9adc73825891f70322d5585ff59b339fb004f5d1eb6428d4
|