a crawler script to extract and author metadata of spatial datasets

These details have not been verified by PyPI

Project links

Project description

pyGeoDataCrawler

The tool crawls a data folder or tree. For each spatial file identified, it will process the file. Extract as many information as possible and store it on a sidecar metadata file.

The tool can also look for existing metadata using common conventions. For metadata imports the tool wil use owslib, which supports some metadata formats.

Several options exist for using the results of the generated index:

The resulting indexed content can be converted to iso19139 or OGCAPI-records and inserted on an instance of pycsw, geonetwork or pygeoapi, to make it searchable.
Automated creation of a mapserver mapfile to provide OGC services on top of the spatial files identified.

Installation

The tool requires GDAL 3.2.2 and pysqlite 0.4.6 to be installed. I recommend to use conda to install them.

conda create --name pgdc python=3.9 
conda activate pgdc
conda install -c conda-forge gdal==3.3.2
conda install -c conda-forge pysqlite3==0.4.6

Then run:

pip install geodatacrawler

Usage

The tools are typically called from commandline or a bash script.

Index metadata

crawl-metadata --mode=init --dir=/myproject/data [--out-dir=/mnt/myoutput]

Mode explained:

init; creates new metadata for files which do not have it yet (not overwriting)
update; updates the metadata, merging new content on existing (not creating new)
export; exports the mcf metadata to xml and stored it in a folder (to be loaded on pycsw) or on a database (todo)
import-csv; imports a csv of metadata fiels into a series of mcf files, typically combined with a .j2 file with same name, which maps the csv-fields to mcf-fields

The export utility will merge any yaml file to a index.yml from a parent folder. This will allow you to create minimal metadata at the detailed level, while providing more generic metadata down the tree. The index.yml is also used as a configuration for any mapfile creation (service metadata).

Most parameters are configured from the commandline, check --help to get explanation. 2 parameters can be set as an environment variable

pgdc_host is the url on which the data will be hosted in mapserver or a webdav folder.
pgdc_schema_path is a physical path to an override of the default iso19139 schema of pygeometa, containing jinja templates to format the exported xml

Some parameters can be set in index.yml, in a robot section. Note that config is inherited from parent folders.

mcf:
    version 1.0
robot: 
  skip-subfolders: True # do not move into subfolders, typically if subfolder is a set of tiles, default: False 
  skip-files: "temp.*" # do not process files matching a regexp, default: None

OGR/GDAL formats

Some GDAL (raster) or OGR (vector) formats, such as FileGDB, GeoPackage and parquet require an additional plugin. Verify for each of the commom formats in your organisation, if the relevant GDAL plugins are installed.

Create mapfile

The metadata identified can be used to create OGC services exposing the files. Currently the tool creates mapserver mapfiles, which are placed on a output-folder. A index.yml configuraton file is expected at the root of the folder to be indexed, if not, it will be created.

crawl-mapfile --dir=/mnt/data [--out-dir=/mnt/mapserver/mapfiles]

Some parameters in the mapfile can be set using environment variables:

Param	Description	Example
pgdc_out_dir	a folder where files are placed (can override with --dir-out)
pgdc_md_url	a pattern on how to link to metadata, use {0} to be substituted by record uuid, or empty to not include metadata link	https://example.com/{0}
pgdc_ms_url	the base url of mapserver	http://example.com/maps
pgdc_webdav_url	the base url on which data files are published or empty if not published	http://example.com/data
pgdc_md_link_types	which service links to add	OGC:WMS,OGC:WFS,OGC:WCS,OGCAPI:Features

export pgdc_webdav_url="https://example.com/data"

A mapserver docker image is provided by Camp to Camp which is able to expose a number of mapfiles as mapservices, eg http://example.com/{mapfile}?request=getcapabilities&service=wms. Each mapfile needs to be configured as alias in mapserver config file.

Layer styling

You can now set dedicated layer styling for grids and vectors.

Add mapserver mapfile syntax to the mcf robot section

robot:
  map:
    styles: |
      CLASS
        NAME "style"
        STYLE
          COLOR 100 100 100
          SIZE 8
          WIDTH 1
        END
      END

For grids, several additional options exist:

A range of colors, the min-max range of the band is devided by the number of colors. Note that you can define multiple styles per layer, the last is used as default.

robot:
  map:
    styles:
      - name: rainbow
        classes: "#ff000,#ffff00,#00ff00,#00ffff,#0000ff"
      - name: grays
        classes: "#00000,#333333,#666666,#999999,#cccccc,#ffffff"

A range of distinct values, you can also use rgb colors

robot:
  map:
    styles:
      - name: rainbow
        classes: 
          - label: True
            val: 1
            color: "0 255 0"
          - label: False
            val: 0
            color: "255 0 0"

A range of classes

robot:
  map:
    styles:
      - name: Scale
        classes: 
          - label: Low
            min: 0
            max: 100
            color: "#0000ff"
          - label: Medium
            min: 100
            max: 200
            color: "#00ff00"
          - label: High
            min: 200
            max: 300
            color: "#ff0000"

Development

Python Poetry

The project is based on common coding conventions from the python poetry community.

On the sources, either run scripts directly:

poetry run crawl-mapfile --dir=/mnt/data

or run a shell in the poetry environment:

poetry shell

The GDAL dependency has some installation issue on poetry, see here for a workaround

> poetry shell
>> sudo apt-get install gdal
>> gdalinfo --version
GDAL 3.3.2, released 2021/09/01
>> pip install gdal==3.3.2
>> exit

Release

update init.py and pyporoject.toml
push changes
trigger semantic release
poetry build
poetry publish

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.3.12

Mar 9, 2026

1.3.11

Jul 1, 2025

1.3.10

Jun 30, 2025

1.3.9

Jun 30, 2025

1.3.8

Jun 24, 2025

1.3.7

Apr 7, 2025

1.3.6

Apr 7, 2025

1.3.5

Mar 16, 2025

1.3.4

Feb 5, 2025

1.3.3

Jan 6, 2025

1.3.2

Sep 17, 2024

1.3.1

Jun 25, 2024

1.2.9

Jun 8, 2024

1.2.8

Jun 8, 2024

1.2.7

Jan 23, 2024

1.2.6

Dec 20, 2023

1.2.5

Dec 13, 2023

1.2.4

Dec 7, 2023

This version

1.2.3

Nov 30, 2023

1.2.2

Nov 30, 2023

1.2.1

Nov 2, 2023

1.2.0

Sep 8, 2023

1.1.4

Aug 25, 2023

1.1.2

Aug 1, 2023

0.1.2

Jul 31, 2023

0.1.1

Jul 31, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geodatacrawler-1.2.3.tar.gz (38.0 kB view details)

Uploaded Nov 30, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

geodatacrawler-1.2.3-py3-none-any.whl (40.0 kB view details)

Uploaded Nov 30, 2023 Python 3

File details

Details for the file geodatacrawler-1.2.3.tar.gz.

File metadata

Download URL: geodatacrawler-1.2.3.tar.gz
Upload date: Nov 30, 2023
Size: 38.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.12 CPython/3.10.6 Linux/5.10.16.3-microsoft-standard-WSL2

File hashes

Hashes for geodatacrawler-1.2.3.tar.gz
Algorithm	Hash digest
SHA256	`15f1dbf5c9303caa5043d9d670aac997eda1b059e0fcacafab7a74ce60b313ed`
MD5	`c901c2f67992f9a6b5b586065c145ab2`
BLAKE2b-256	`5f7973637a976ac202e72dc9dd569d0779c6f5ffd215ac10b28ba978f78f6482`

See more details on using hashes here.

File details

Details for the file geodatacrawler-1.2.3-py3-none-any.whl.

File metadata

Download URL: geodatacrawler-1.2.3-py3-none-any.whl
Upload date: Nov 30, 2023
Size: 40.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.12 CPython/3.10.6 Linux/5.10.16.3-microsoft-standard-WSL2

File hashes

Hashes for geodatacrawler-1.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`86d295235fc857942b98e14bc9f832bfda1b2e7793b71f741051988b247cb570`
MD5	`23fddef818b5c45053f1fda54c44ecc3`
BLAKE2b-256	`f3cf69edd4383295dede8148ac88358057f2c6dbaefeaf5c8cd33ab8d6dba70f`

See more details on using hashes here.

geodatacrawler 1.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pyGeoDataCrawler

Installation

Usage

Index metadata

OGR/GDAL formats

Create mapfile

Layer styling

Development

Python Poetry

Release

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes