Skip to main content

This project automatically downloads taxon-specific images from the GBIF API (https://techdocs.gbif.org/en/openapi/), processes them, and stores both images and metadata in a taxonomically organized structure in a MinIO (https://www.min.io/) bucket.

Project description

🌳 GBIF Image Downloader

This project automatically downloads taxon-specific images from the GBIF API, processes them, and stores both images and metadata in a taxonomically organized structure in a MinIO bucket.


Features

  • Loads Latin taxon names from .csv or .xlsx files
  • Resolves taxonKeys automatically via the GBIF API
  • Downloads associated media (images) from GBIF
  • Stores metadata and images in a taxonomic folder structure in MinIO
  • Optionally processes only new GBIF occurrences (crawl_new_entries)
  • Multithreading for parallel processing and uploads
  • Logging directly to MinIO

Usage

Installation

Install dependencies via:

pip install -r requirements.txt

1. Prepare your input file

Create a .csv or .xlsx file with at least the following column:

latin_name
Quercus robur
Fagus sylvatica

2. Adjust your configuration

Edit the file config/config.yaml to set your MinIO connection, output paths, and processing options.
A typical configuration looks like this:

minio:
  bucket: meinewaldki-gbif         # Name of your MinIO bucket
  endpoint: s3.anhalt.ai           # MinIO/S3 endpoint URL
  secure: true                     # Use HTTPS (true/false)
  cert_check: true                 # Check SSL certificates (true/false)

paths:
  output: gbif-test/               # Output directory for images and metadata
  tree_list_input_path: data/tree_list.xlsx   # Path to your input taxon list
  processed_tree_list_path: data/species_key.csv # Path for the processed taxonKey list
  log_dir: logs/                   # Directory for log files

query_params:
  mediaType: StillImage            # Only download images
  limit: 100                       # Number of records per API call
  offset: 0                        # Start offset

options:
  already_preprocessed: True       # Set False to process the taxon list again
  crawl_new_entries: False         # Only process new occurrences if True
  max_threads: 10                  # Number of parallel threads for downloads/uploads

Query Parameters for GBIF API URL

The parameters used to build the GBIF API request URL are defined in the query_params section of your config/config.yaml. These parameters control which records are fetched from the GBIF API.

Supported parameters:

  • mediaType (e.g. StillImage): Only download records with images.
  • taxonKey: The taxon key.
  • datasetKey: Filter by dataset.
  • country: Filter by country code (e.g. DE for Germany).
  • hasCoordinate: Only records with coordinates (true or false).
  • year, month: Filter by year or month of occurrence.
  • basisOfRecord: Type of record (e.g. HUMAN_OBSERVATION).
  • recordedBy: Filter by collector/observer.
  • institutionCode, collectionCode: Filter by institution or collection.
  • limit: Number of records per API call (pagination, max. 300).
  • offset: Start offset for pagination.

How it works:

  • All parameters in query_params are automatically validated at startup.
  • Only the above parameters are allowed. Invalid parameters will cause the program to stop with an error.

3. Process taxonKey list and resolve taxonKeys

from anhaltai.gbif_downloader.tree_list_processor import TreeListProcessor

processor = TreeListProcessor(input_path="data/tree_list.xlsx",
                              sheet_name="Gehölzarten", taxon="speciesKey")
processor.process_tree_list(output_path="data/species_key.csv")

4. Download media and metadata from GBIF

Run the main program:

PYTHONPATH=src python3 src/gbif_extractor/main.py

Note:

  • MinIO credentials must be set in .env see .env-example for the required format.
  • Log files are automatically uploaded to MinIO.
  • Parallel processing and uploads are controlled by a configurable thread limit.
  • Semaphores are used in this project to control the number of concurrent threads during uploads to MinIO.
  • The program will skip old entries if crawl_new_entries is set to True.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anhaltai_gbif_downloader-2025.9.dev16345.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file anhaltai_gbif_downloader-2025.9.dev16345.tar.gz.

File metadata

File hashes

Hashes for anhaltai_gbif_downloader-2025.9.dev16345.tar.gz
Algorithm Hash digest
SHA256 9da13d78d309ddaa292682ba389b63938beac409cd52bfc3f7ddf4c6513dc3f3
MD5 357ad80abb1aa7c7ffe1b89c899313b3
BLAKE2b-256 b8e5b5ea8d8ef859f08ff48f0edfafae9cbdf3170a3db7973efc8d009ba163b7

See more details on using hashes here.

File details

Details for the file anhaltai_gbif_downloader-2025.9.dev16345-py3-none-any.whl.

File metadata

File hashes

Hashes for anhaltai_gbif_downloader-2025.9.dev16345-py3-none-any.whl
Algorithm Hash digest
SHA256 0f2bfd391e14c223a2594395c42d10395d3a25598be0d7bc573ebd0371f5caf8
MD5 0fdfd712514be89422b6565d500d70e1
BLAKE2b-256 6b8bdd65ad8b0ec503ce7c7e172b97dcaee33690b019b56fe8be86d48d5b9721

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page