HErbarium Specimen sheet PIpeline
Project description
HErbarium Specimen sheet PIpeline
Hespi takes images of specimen sheets from herbaria and first detects the various components of the sheet. These components include:
small database label
handwritten data
stamp
annotation label
scale
swing tag
full database label
database label
swatch
institutional label
number
Then it takes any institutional label and detects the following fields from it:
family,
genus,
species,
infrasp_taxon,
authority,
collector_number,
collector,
locality,
geolocation,
year,
month,
day,
These text fields are then run through the OCR program Tesseract.
Installation
Install hespi using pip:
pip install hespi
The first time it runs, it will download the required model weights from the internet.
It is recommended that you also install Tesseract so that this can be used in the text recognition part of the pipeline.
Usage
To run the pipeline, use the executable hespi and give it any number of images:
hespi image1.jpg image2.jpg
This will prompt you to specify an output directory. You can set the output directory with the command with the --output-dir argument:
hespi images/*.tif --output-dir ./hespi-output
The detected components and text fields will be cropped and stored in the output directory. There will also be a CSV file with the text recognition results for any institutional labels found.
Credits
Robert Turnbull, Karen Thompson, Emily Fitzgerald, Jo Birch.
Publication and citation details to follow.
This pipeline depends on YOLOv5, torchapp, Microsoft’s TrOCR.
Logo derived from artwork by ka reemov.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.