This project automatically downloads taxon-specific images from the GBIF API (https://techdocs.gbif.org/en/openapi/), processes them, and stores both images and metadata in a taxonomically organized structure in a MinIO (https://www.min.io/) bucket.

These details have not been verified by PyPI

Project description

🌳 GBIF Image Downloader

This project automatically downloads taxon-specific images from the GBIF API, processes them, and stores both images and metadata in a taxonomically organized structure in a MinIO bucket.

Features

Loads Latin taxon names from .csv or .xlsx files
Resolves taxonKeys automatically via the GBIF API
Downloads associated media (images) from GBIF
Stores metadata and images in a taxonomic folder structure in MinIO
Optionally processes only new GBIF occurrences (crawl_new_entries)
Multithreading for parallel processing and uploads
Saves Logfiles to persistent volume

Project Structure

├── config/
│   └── config.yaml                    # Central configuration (bucket, paths, etc.)
├── data/
│   ├── species_key.csv                # Output: species list with GBIF speciesKeys
│   └── tree_list.xlsx                 # Input: original species list
├── src/
│   └── anhaltai/
│       ├── gbif_downloader/
│       │   ├── crawler/
│       │   │   ├── __init__.py        # Package initialization
│       │   │   └── base_crawler.py    # Base logic for crawling occurrences
│       │   ├── __init__.py            # Package initialization
│       │   ├── config.py              # Loads global configuration
│       │   ├── config_loader.py       # Loads configuration from YAML
│       │   ├── downloader.py          # Download & upload of occurrences and media
│       │   ├── local_log_handler.py   # Log handler that writes logs to MinIO
│       │   ├── main.py                # Entry point, orchestrates all steps
│       │   ├── tree_list_processor.py # Processes taxon lists, resolves taxonKeys
│       │   └── utils.py               # Utility functions (hashing, upload, etc.)
│       └──  __init__.py               # Package initialization
│
├── .dockerignore                      # Files to ignore in Docker build
├── .env                               # MinIO credentials (not in repo)
├── .env-example                       # Example MinIO credentials format
├── .gitattributes                     # Git attributes for large file systems
├── .gitignore                         # Files to ignore in git
├── .gitlab-ci.yml                     # GitLab CI/CD configuration
├── Dockerfile                         # Container build
├── LICENSE                            # License information
├── pyproject.toml                     # Python project configuration
├── README.md                          # Project documentation
├── requirements.txt                   # Python dependencies
└── sonar-project.properties           # SonarQube configuration

Usage

Installation

Install dependencies via:

pip install -r requirements.txt

1. Prepare your input file

Create a .csv or .xlsx file with at least the following column:

latin_name
Quercus robur
Fagus sylvatica

2. Adjust your configuration

Edit the file config/config.yaml to set your MinIO connection, output paths, and processing options.
A typical configuration looks like this:

minio:
  bucket: meinewaldki-gbif         # Name of your MinIO bucket
  endpoint: 10.144.46.54:9000           # MinIO/S3 endpoint URL
  secure: false                     # Use HTTPS (true/false)

paths:
  output: gbif/                    # Output directory for images and metadata
  tree_list_input_path: data/tree_list.xlsx      # Path to your input taxon list
  processed_tree_list_path: data/species_key.csv # Path for the processed taxonKey list
  log_dir: logs/                   # Directory for log files

query_params:
  mediaType: StillImage            # Only download images
  limit: 100                       # Number of records per API call
  offset: 0                        # Start offset

options:
  already_preprocessed: True         # Set False to process the taxon list again
  crawl_new_entries: False           # Only process new occurrences if True
  max_threads: 300                   # Number of parallel threads for downloads/uploads
  max_pool_size: 50                  # Max connections in Minio-pool

Query Parameters for GBIF API URL

The parameters used to build the GBIF API request URL are defined in the query_params section of your config/config.yaml. These parameters control which records are fetched from the GBIF API.

Supported parameters:

mediaType (e.g. StillImage): Only download records with images.
taxonKey: The taxon key.
datasetKey: Filter by dataset.
country: Filter by country code (e.g. DE for Germany).
hasCoordinate: Only records with coordinates (true or false).
year, month: Filter by year or month of occurrence.
basisOfRecord: Type of record (e.g. HUMAN_OBSERVATION).
recordedBy: Filter by collector/observer.
institutionCode, collectionCode: Filter by institution or collection.
limit: Number of records per API call (pagination, max. 300).
offset: Start offset for pagination.

How it works:

All parameters in query_params are automatically validated at startup.
Only the above parameters are allowed. Invalid parameters will cause the program to stop with an error.

3. Process taxonKey list and resolve taxonKeys

from anhaltai.gbif_downloader.tree_list_processor import TreeListProcessor

processor = TreeListProcessor(input_path="data/tree_list.xlsx",
                              sheet_name="Gehölzarten", taxon="speciesKey")
processor.process_tree_list(output_path="data/species_key.csv")

4. Download media and metadata from GBIF

Run the main program:

PYTHONPATH=src python3 src/anhaltai/gbif_extractor/main.py

Note:

MinIO credentials must be set in .env see .env-example for the required format.
Log files are automatically saved in persistent Volume mnt/logs/.
Parallel processing and uploads are controlled by a configurable thread limit.
The program will skip old entries if crawl_new_entries is set to True.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2025.9.0a90 pre-release

Sep 2, 2025

2025.9.0a81 pre-release

Sep 2, 2025

2025.9.0a79 pre-release

Sep 2, 2025

2025.9.0a77 pre-release

Sep 2, 2025

2025.9.dev16345 pre-release

Aug 28, 2025

2025.8.2rc1 pre-release

Aug 28, 2025

2025.8.1

Aug 28, 2025

2025.8.0

Aug 28, 2025

2025.8.0rc1 pre-release

Aug 28, 2025

2025.8.dev16234 pre-release

Aug 28, 2025

2025.8.dev16123 pre-release

Aug 28, 2025

0.1.2rc1 pre-release

Sep 2, 2025

0.1.1

Sep 2, 2025

0.1.0

Sep 2, 2025

0.1.0a259 pre-release

Apr 27, 2026

0.1.0a257 pre-release

Mar 20, 2026

0.1.0a256 pre-release

Mar 19, 2026

0.1.0a223 pre-release

Jan 7, 2026

0.1.0a222 pre-release

Jan 7, 2026

0.1.0a216 pre-release

Jan 7, 2026

0.1.0a214 pre-release

Jan 7, 2026

0.1.0a212 pre-release

Jan 7, 2026

0.1.0a210 pre-release

Jan 7, 2026

0.1.0a208 pre-release

Jan 7, 2026

0.1.0a206 pre-release

Jan 7, 2026

This version

0.1.0a204 pre-release

Jan 7, 2026

0.1.0a202 pre-release

Jan 7, 2026

0.1.0a200 pre-release

Jan 7, 2026

0.1.0a198 pre-release

Jan 7, 2026

0.1.0a196 pre-release

Jan 7, 2026

0.1.0a194 pre-release

Jan 7, 2026

0.1.0a144 pre-release

Oct 15, 2025

0.1.0a143 pre-release

Oct 15, 2025

0.1.0a141 pre-release

Oct 15, 2025

0.1.0a135 pre-release

Oct 1, 2025

0.1.0a133 pre-release

Oct 1, 2025

0.1.0a132 pre-release

Oct 1, 2025

0.1.0a131 pre-release

Oct 1, 2025

0.1.0a130 pre-release

Oct 1, 2025

0.1.0a128 pre-release

Oct 1, 2025

0.1.0a125 pre-release

Sep 23, 2025

0.1.0a95 pre-release

Sep 2, 2025

0.1.0a94 pre-release

Sep 2, 2025

0.0.0

Sep 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anhaltai_gbif_downloader-0.1.0a204.tar.gz (17.8 kB view details)

Uploaded Jan 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

anhaltai_gbif_downloader-0.1.0a204-py3-none-any.whl (18.7 kB view details)

Uploaded Jan 7, 2026 Python 3

File details

Details for the file anhaltai_gbif_downloader-0.1.0a204.tar.gz.

File metadata

Download URL: anhaltai_gbif_downloader-0.1.0a204.tar.gz
Upload date: Jan 7, 2026
Size: 17.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for anhaltai_gbif_downloader-0.1.0a204.tar.gz
Algorithm	Hash digest
SHA256	`1992d4ec6135c192edb2e1b9e49a4d48b7b0cf8943b6b0cdd6be23a71465024a`
MD5	`016ccc360a551f1696c9053b92f6f3c4`
BLAKE2b-256	`38e1e7564a97e46097237e4713d94d6f4108376e2d9471cc8ec851d8d2b25707`

See more details on using hashes here.

File details

Details for the file anhaltai_gbif_downloader-0.1.0a204-py3-none-any.whl.

File metadata

Download URL: anhaltai_gbif_downloader-0.1.0a204-py3-none-any.whl
Upload date: Jan 7, 2026
Size: 18.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for anhaltai_gbif_downloader-0.1.0a204-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ae86d34da72f2ac15f859833a12aaca85f9efdc1f7ec948ae8a4ef2eaa0914ae`
MD5	`616d672a4e37ee62f9aa4cf94cd8190f`
BLAKE2b-256	`3173516149275e09af2aba5d53e963fff46721c06f862313b892c95df02ffbe6`

See more details on using hashes here.

anhaltai-gbif-downloader 0.1.0a204

Navigation

Verified details

Project links

Maintainers

Unverified details

Meta

Classifiers

Project description

🌳 GBIF Image Downloader

Features

Project Structure

Usage

Installation

1. Prepare your input file

2. Adjust your configuration

Query Parameters for GBIF API URL

3. Process taxonKey list and resolve taxonKeys

4. Download media and metadata from GBIF

Note:

Project details

Verified details

Project links

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes