This project automatically downloads taxon-specific images from the GBIF API (https://techdocs.gbif.org/en/openapi/), processes them, and stores both images and metadata in a taxonomically organized structure in a MinIO (https://www.min.io/) bucket.
Project description
๐ณ GBIF Image Downloader
This project automatically downloads taxon-specific images from the GBIF API, processes them, and stores both images and metadata in a taxonomically organized structure in a MinIO bucket.
Features
- Loads Latin taxon names from
.csvor.xlsxfiles - Resolves
taxonKeysautomatically via the GBIF API - Downloads associated media (images) from GBIF
- Stores metadata and images in a taxonomic folder structure in MinIO
- Optionally processes only new GBIF occurrences (
crawl_new_entries) - Multithreading for parallel processing and uploads
- Saves Logfiles to persistent volume
Project Structure
โโโ config/
โ โโโ config.yaml # Central configuration (bucket, paths, etc.)
โโโ data/
โ โโโ species_key.csv # Output: species list with GBIF speciesKeys
โ โโโ tree_list.xlsx # Input: original species list
โโโ src/
โ โโโ anhaltai/
โ โโโ gbif_downloader/
โ โ โโโ crawler/
โ โ โ โโโ __init__.py # Package initialization
โ โ โ โโโ base_crawler.py # Base logic for crawling occurrences
โ โ โโโ __init__.py # Package initialization
โ โ โโโ config.py # Loads global configuration
โ โ โโโ config_loader.py # Loads configuration from YAML
โ โ โโโ downloader.py # Download & upload of occurrences and media
โ โ โโโ local_log_handler.py # Log handler that writes logs to MinIO
โ โ โโโ main.py # Entry point, orchestrates all steps
โ โ โโโ tree_list_processor.py # Processes taxon lists, resolves taxonKeys
โ โ โโโ utils.py # Utility functions (hashing, upload, etc.)
โ โโโ __init__.py # Package initialization
โ
โโโ .dockerignore # Files to ignore in Docker build
โโโ .env # MinIO credentials (not in repo)
โโโ .env-example # Example MinIO credentials format
โโโ .gitattributes # Git attributes for large file systems
โโโ .gitignore # Files to ignore in git
โโโ .gitlab-ci.yml # GitLab CI/CD configuration
โโโ Dockerfile # Container build
โโโ LICENSE # License information
โโโ pyproject.toml # Python project configuration
โโโ README.md # Project documentation
โโโ requirements.txt # Python dependencies
โโโ sonar-project.properties # SonarQube configuration
Usage
Installation
Install dependencies via:
pip install -r requirements.txt
1. Prepare your input file
Create a .csv or .xlsx file with at least the following column:
| latin_name |
|---|
| Quercus robur |
| Fagus sylvatica |
2. Adjust your configuration
Edit the file config/config.yaml to set your MinIO connection, output paths, and processing options.
A typical configuration looks like this:
minio:
bucket: meinewaldki-gbif # Name of your MinIO bucket
endpoint: 10.144.46.54:9000 # MinIO/S3 endpoint URL
secure: false # Use HTTPS (true/false)
paths:
output: gbif/ # Output directory for images and metadata
tree_list_input_path: data/tree_list.xlsx # Path to your input taxon list
processed_tree_list_path: data/species_key.csv # Path for the processed taxonKey list
log_dir: logs/ # Directory for log files
query_params:
mediaType: StillImage # Only download images
limit: 100 # Number of records per API call
offset: 0 # Start offset
options:
already_preprocessed: True # Set False to process the taxon list again
crawl_new_entries: False # Only process new occurrences if True
max_threads: 300 # Number of parallel threads for downloads/uploads
max_pool_size: 50 # Max connections in Minio-pool
Query Parameters for GBIF API URL
The parameters used to build the GBIF API request URL are defined in the query_params section of your
config/config.yaml. These parameters control which records are fetched from the GBIF API.
Supported parameters:
mediaType(e.g.StillImage): Only download records with images.taxonKey: The taxon key.datasetKey: Filter by dataset.country: Filter by country code (e.g.DEfor Germany).hasCoordinate: Only records with coordinates (trueorfalse).year,month: Filter by year or month of occurrence.basisOfRecord: Type of record (e.g.HUMAN_OBSERVATION).recordedBy: Filter by collector/observer.institutionCode,collectionCode: Filter by institution or collection.limit: Number of records per API call (pagination, max. 300).offset: Start offset for pagination.
How it works:
- All parameters in
query_paramsare automatically validated at startup. - Only the above parameters are allowed. Invalid parameters will cause the program to stop with an error.
3. Process taxonKey list and resolve taxonKeys
from anhaltai.gbif_downloader.tree_list_processor import TreeListProcessor
processor = TreeListProcessor(input_path="data/tree_list.xlsx",
sheet_name="Gehรถlzarten", taxon="speciesKey")
processor.process_tree_list(output_path="data/species_key.csv")
4. Download media and metadata from GBIF
Run the main program:
PYTHONPATH=src python3 src/anhaltai/gbif_extractor/main.py
Note:
- MinIO credentials must be set in
.envsee.env-examplefor the required format. - Log files are automatically saved in persistent Volume
mnt/logs/. - Parallel processing and uploads are controlled by a configurable thread limit.
- The program will skip old entries if
crawl_new_entriesis set toTrue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file anhaltai_gbif_downloader-0.1.0a216.tar.gz.
File metadata
- Download URL: anhaltai_gbif_downloader-0.1.0a216.tar.gz
- Upload date:
- Size: 17.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d46f0834058ad98f844987e9251133f0827e7d428fe8d214bd66d1f75112ad2a
|
|
| MD5 |
72c645e2dbf39afc78a70bc09010e02b
|
|
| BLAKE2b-256 |
c42fce67e18466990aa2b2938d57e4b70b0e9e40ab9329fb94d2c6a48b541414
|
File details
Details for the file anhaltai_gbif_downloader-0.1.0a216-py3-none-any.whl.
File metadata
- Download URL: anhaltai_gbif_downloader-0.1.0a216-py3-none-any.whl
- Upload date:
- Size: 18.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2d884af0b3f793952fbbdbf0ee32549236a1cde77f63dd09b28632ff62f521f
|
|
| MD5 |
155cf90481dd91a9fc621907ba459fb1
|
|
| BLAKE2b-256 |
4dfe21e410cd0206522b08f58ae73438ddde20a6ba1863e80af0fb49fe37830f
|