A tool for working with archival description for public access.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

description_harvester

A tool for working with archival description for public access. description_harvester reads archival description into a minimalist data model for public-facing archival description and then converts it to the ArcLight data model and POSTs it into an ArcLight Solr index using PySolr.

description_harvester is designed to be extensible and harvest archival description from a number of sources. Currently the only available sources harvests data from the ArchivesSpace API using ArchivesSnake or EAD 2002 XML files. Its also possible to add additional output modules to serialize description to EAD or other formats in addition to or in replace of sending description to an ArcLight Solr instance. This potential opens up new possibilities of managing description using low-barrier formats and tools.

description_harvester is designed to be a drop-in replacement for the ArcLight Traject indexer. It also includes a plugin that attempts to recognized IIIF manifests included as file versions and uses manifests to fully index digital objects from digital repositories and other sources, including item-level metadata fields, embedded text, OCR text, and transcriptions.

This is still a bit drafty, as its only tested on ASpace v2.8.0 and needs tests and better error handling. Validation is also very minimal, but there is potential to add detailed validation with jsonschema .

Installation

pip install description_harvester

First, you need to configure ArchivesSnake by creating a ~/.archivessnake.ymlfile with your API credentials as detailed by the ArchivesSnake configuration docs.

Next, you also need a ~/.description_harvester/config.yml file that lists your Solr URL and the core you want to index to. These can also be overridden with args. description_harvester reads your config.yml as utf-8, so if you're creating this file in a Windows environment you should ensure its utf-8.

solr_url: http://127.0.0.1:8983/solr
solr_core: blacklight-core
last_query: 0
cache_expiration: 3600

Adding custom digital object metadata

You can also add custom digital object metadata fields by adding them to your config.yml under the Solr suffix you would like them to be indexed as. These fields must match metadata fields in your IIIF manifests.

metadata:
- ssi:
  - date_uploaded
- ssm:
  - date_digitized
  - extent
- ssim:
  - legacy_id
  - resource_type
  - coverage
  - preservation_package
  - creator
  - contributor
  - preservation_format
  - source
- tesm:
  - processing_activity
- tesim:
  - description

Repositories

By default, when reading from ArchivesSpace, description harvester will use the repository name stored there.

To enable the --repo argument, place a copy of your ArcLight repositories.yml file as ~/.description_harvester/repositories.yml. You can then use harvest --id mss001 --repo slug to index using the slug from repositories.yml. This will overrite the ArchivesSpace repository name.

There is also the option do customize this with a plugin.

Encoding note: While ArcLight does not explicitly read repositories.yml as utf-8, its Rails stack means that you're likely reading it in a utf-8 (non-Windows) environment. Since description_harvester enables you to index from a Windows machine, it expects your ~/.description_harvester/repositories.yml file to be utf-8.

Indexing from ArchivesSpace API to ArcLight

Once description_harvester is set up, you can index from the ASpace API to ArcLight using the to-ArcLight command.

Index by id_0

You can provide one or more IDs to index using a resource's id_0` field

harvest --id ua807

harvest --id mss123 apap106

Index by URI

You can also use integers from ASpace URIs for resource, such as 263 for https://my.aspace.edu/resources/263

harvest --uri 435

harvest --uri 1 755

Indexing by modified time

Index collections modified in the past hour: harvest --hour

Index collections modified in the past day: harvest --today

Index collections modified since las run: harvest --updated

Index collections not already in the index: harvest --new

Indexing from EAD 2002

harvest --ead path/to/ead.xml

You can also give it a directory and it will harvest all *.xml

harvest --ead path/to/ead_files

Verbose output

harvest --id ger071 -v

Caching

description_harvester will cache collections from the ArchivesSpace API, storing them by default to ~/.description_harvester/cache after they are converted to the description model. Cache time is set in seconds as cache_expiration in ~/.description_harvester/config.yml. Thus, cache_expiration: 3600 will use the cached data instead of the ArchivesSpace API for data less than 1 hour old.

You can override the cache path in config or turn caching off gobally with cache_dir: false.

cache_dir: "~/path/to/my_cache"
cache_dir: "C:/Users/username/my_cache"
cache_dir: false

Deleting collections

You can delete one or more collections using the --delete argument. This uses the Solr document ID, such as apap106 for https://my.ArcLight.edu/catalog/apap106.

harvest --delete apap101 apap301

Plugins

Local implementations may have to override some description_harvester logic. Indexing digital objects from local systems may be a common use case.

To create a plugin, create a plugin directory, either at ~/.description_harvester or a path you pass with a DESCRIPTION_HARVESTER_PLUGIN_DIR environment variable.

Use the example default.py and make a copy in your plugin directory.

Use custom_repository() to customize how repository names are set. This has access to an ArchivesSpace resource API object.

Use read_data() to customize DigitalObject objects.

The plugin importer will first import plugins from within the package, second it will look in ~/.description_harvester, and finally it will look in the DESCRIPTION_HARVESTER_PLUGIN_DIR path.

Use as a library

You can also use description harvester in a script

from description_harvester import harvest

harvest(["--id", "myid001"])

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.10.0

Feb 17, 2026

0.9.3

Feb 9, 2026

0.9.2

Jan 30, 2026

0.9.1

Jan 30, 2026

0.9.0

Jan 30, 2026

0.8.3

Jan 30, 2026

0.8.2

Jan 14, 2026

0.8.1

Dec 9, 2025

0.8.0

Dec 9, 2025

0.7.2

Dec 8, 2025

This version

0.7.1

Dec 5, 2025

0.7.0

Dec 4, 2025

0.6.0

Nov 6, 2025

0.5.2

Oct 27, 2025

0.5.1

Oct 24, 2025

0.5.0

Oct 22, 2025

0.4.2

Jun 20, 2025

0.4.1

May 27, 2025

0.4.0

Apr 28, 2025

0.3.12

Apr 25, 2025

0.3.11

Apr 10, 2025

0.3.10

Apr 10, 2025

0.3.9

Apr 8, 2025

0.3.8

Apr 7, 2025

0.3.7

Apr 4, 2025

0.3.6

Apr 1, 2025

0.3.5

Mar 31, 2025

0.3.4

Mar 28, 2025

0.3.3

Mar 28, 2025

0.3.2

Mar 28, 2025

0.3.1

Mar 27, 2025

0.3.0

Mar 21, 2025

0.2.1

Mar 21, 2025

0.2.0

Mar 19, 2025

0.1.1

Mar 19, 2025

0.1.0

Mar 19, 2025

0.0.5

Aug 22, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

description_harvester-0.7.1.tar.gz (53.1 kB view details)

Uploaded Dec 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

description_harvester-0.7.1-py3-none-any.whl (58.4 kB view details)

Uploaded Dec 5, 2025 Python 3

File details

Details for the file description_harvester-0.7.1.tar.gz.

File metadata

Download URL: description_harvester-0.7.1.tar.gz
Upload date: Dec 5, 2025
Size: 53.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for description_harvester-0.7.1.tar.gz
Algorithm	Hash digest
SHA256	`7381eabc1af5bb40bd76c0d9c1b769a5ccc3221a635152121f87b34afd86ffa3`
MD5	`958675ace32d75e188ba910ab043cee6`
BLAKE2b-256	`62606dc147423be538bc0bac101cf108a4d63a37e256020c50a379a18cb906c4`

See more details on using hashes here.

File details

Details for the file description_harvester-0.7.1-py3-none-any.whl.

File metadata

Download URL: description_harvester-0.7.1-py3-none-any.whl
Upload date: Dec 5, 2025
Size: 58.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for description_harvester-0.7.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`95045814d8bdbef2d944f273f64557e8019ea61f5fca614c4f17ee8333af7f42`
MD5	`9ce61dc706e2f9bfaa998812786b0c22`
BLAKE2b-256	`b7e68a3e49fc9cbee0a2b0536d1470cb74ddab3fff0129d23948554901ce5e7a`

See more details on using hashes here.

description-harvester 0.7.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

description_harvester

Installation

Adding custom digital object metadata

Repositories

Indexing from ArchivesSpace API to ArcLight

Index by id_0

Index by URI

Indexing by modified time

Indexing from EAD 2002

Verbose output

Caching

Deleting collections

Plugins

Use as a library

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes