Skip to main content

Crawl, extract and push climate metadata for indexing.

Project description

metadata-crawler

License PyPI Conda Version Docs Tests Test-Coverage

Harvest, normalise, and index climate / earth-system metadata from POSIX, S3/MinIO, and OpenStack Swift using configurable DRS dialects (CMIP6, CMIP5, CORDEX, …). Output to a temporary catalogue (JSONLines) and then index into systems such as Solr or MongoDB. Configuration is TOML with inheritance, templating, and computed rules.

TL;DR

  • Define datasets + dialects in drs_config.toml
  • mdc add → write a temporary catalogue (jsonl.gz)
  • mdc config → inspect a the (merged) crawler config.
  • mdc walk-intake → inspect the content of an intake catalogue.
  • mdc <backend> index → push records from catalogue into your index backend
  • mdc <backend> delete → remove records by facet match

Features

  • Multi-backend discovery: POSIX, S3/MinIO, Swift (async REST), Intake
  • Two-stage pipeline: crawl → catalogue then catalogue → index
  • Schema driven: strong types (e.g. string, datetime[2], float[4], string[])
  • DRS dialects: packaged CMIP6/CMIP5/CORDEX; build your own via inheritance
  • Path specs & data specs: parse directory/filename parts and/or read dataset attributes/vars
  • Special rules: conditionals, cache lookups and function calls (e.g. CMIP6 realm, time aggregation)
  • Index backends: MongoDB (Motor), Solr
  • Sync + Async APIs and a clean CLI
  • Docs: Sphinx with pydata_sphinx_theme

Install

   pip install metadata-crawler
   conda install -c conda-forge metadata-crawler

Quickstart (CLI)

   # 1) Crawl  write catalogue
   mdc add \
     cat.yaml \
     --config-file drs_config.toml \
     --dataset cmip6-fs,obs-fs \
     --threads 4 --batch-size 100

   # 2) Index from catalogue  Solr (or Mongo)
   mdc solr index \
     cat.yaml \
     --server localhot:8983

   # 3) Delete by facets (supports globs on values)
   mdc delete \
     --server localhost:8983 \
     --facets "file *.nc" --facets "project CMIP6"

[!NOTE] The CLI is a custom framework inspired by Typer (not Typer itself). Use --help on any subcommand to see all options.

Minimal config (drs_config.toml)

   # === Canonical schema ===
   [drs_settings.schema.file]
   key      = "file"
   type     = "path"
   required = true
   indexed  = true
   unique   = true

   [drs_settings.schema.uri]
   key      = "uri"
   type     = "uri"
   required = true
   indexed  = true

   [drs_settings.schema.variable]
   key          = "variable"
   type         = "string[]"
   multi_valued = true
   indexed      = true

   [drs_settings.schema.time]
   key     = "time"
   type    = "datetime[2]"     # [start, end]
   indexed = true
   default = []

   [drs_settings.schema.bbox]
   key     = "bbox"
   type    = "float[4]"        # [W,E,S,N]
   default = [0, 360, -90, 90]

   # === Dialect: CMIP6 (example) ===
   [drs_settings.dialect.cmip6]
   sources   = ["path","data"]         # path | data | storage
   defaults.grid_label = "gn"
   specs_dir  = ["mip_era","activity_id","institution_id","source_id","experiment_id","member_id","table_id","variable_id","grid_label","version"]
   specs_file = ["variable_id","table_id","source_id","experiment_id","member_id","grid_label","time"]

   [drs_settings.dialect.cmip6.special.realm]
   type   = "method"
   method = "_get_realm"
   args   = ["table_id","variable_id","__file_name__"]

   [drs_settings.dialect.cmip6.special.time_aggregation]
   type   = "method"
   method = "_get_aggregation"
   args   = ["table_id","variable_id","__file_name__"]

   # === Dialect: CORDEX (bbox by domain) ===
   [drs_settings.dialect.cordex]
   sources   = ["path","data"]
   specs_dir = ["project","product","domain","institution","driving_model","experiment","ensemble","rcm_name","rcm_version","time_frequency","variable","version"]
   specs_file= ["variable","domain","driving_model","experiment","ensemble","rcm_name","rcm_version","time_frequency","time"]

   [drs_settings.dialect.cordex.special.bbox]
   type   = "call"
   method = "dialect['cordex']['domains'].get('{{domain | upper }}', [0,360,-90,90])"

   [drs_settings.dialect.cordex.domains]
   EUR-11 = [-44.14, 64.40, 22.20, 72.42]
   AFR-44 = [-24.64, 60.28, -45.76, 42.24]

   # === Datasets ===
   [cmip6-fs]
   root_path  = "/data/model/global/cmip6"
   drs_format = "cmip6"             # dialect name
   fs_type    = "posix"

   [cmip6-s3]
   root_path        = "s3://test-bucket/data/model/global/cmip6"
   drs_format       = "cmip6"
   fs_type          = "s3"
   storage_options.endpoint_url = "http://127.0.0.1:9000"
   storage_options.aws_access_key_id = "minioadmin"
   storage_options.aws_secret_access_key = "minioadmin"
   storage_options.region_name = "us-east-1"
   storage_options.url_style   = "path"
   storage_options.use_ssl     = false

   [obs-fs]
   root_path  = "/arch/observations"
   drs_format = "custom"
   # define your specs_dir/specs_file or inherit from another dialect

Concepts

Schema (facet definitions)

Each canonical facet describes:

  • key: where to read value ("project", "variable",)
  • type: string, integer, float, datetime, with arrays like float[4], string[], datetime[2], or special types like file, uri, fs_type, dataset, fmt
  • required, default, indexed, unique, multi_valued

Dialects

A dialect tells the crawler how to interpret paths and read data:

  • sources: which sources to consult (path, data, storage) in priority
  • specs_dir / specs_file: ordered facet names encoded in directory and file names
  • data_specs: pull values from dataset content (attrs/variables); supports __variable__ and templated specs
  • special: computed fields (conditional | method | function)
  • Optional lookups (e.g., CORDEX domains for bbox)

Path specs vs data specs

  • Path specs parse segments from the path, e.g.: /project/product/institute/model/experiment/.../variable_time.nc
  • Data specs read from the dataset itself (e.g., xarray/global attribute, variable attributes, per-var stats). Example: gather all variables __variable__, then their units with a templated selector.

Inheritance

Create new dialects/datasets by inheriting:

   [drs_settings.dialect.reana]
   inherits_from = "cmip5"
   sources       = ["path","data"]
   [drs_settings.dialect.reana.data_specs.read_kws]
   engine = "h5netcdf"

Python API

Async

   import asyncio
   from metadata_crawler.run import async_add, async_index, async_delete

   async def main():
       # crawl → catalogue
       await async_add(
           "cat.yaml",
           config_file="drs_config.toml",
           dataset_names=["cmip6-fs"],
           threads=4,
           batch_size=100,
       )
       # index → backend
       await async_index(
           "solr",
           "cat.yaml",
           config_file="drs_config.toml",
           server="localhost:8983",
       )
       # delete by facets
       await async_delete(
           config_path="drs_config.toml",
           index_store="solr",
           facets=[("file", "*.nc")],
       )

   asyncio.run(main())

Sync (simple wrapper)

   import asyncio
   from metadata_crawler import add

   add(
       store="cat.yaml",
       config_file="drs_config.toml",
       dataset_names=["cmip6-fs"],
   )

Index backends

  • MongoDB (Motor): upserts by unique facet (e.g., file), bulk deletes (glob → regex)
  • Solr: fields align with managed schema; supports multi-valued facets

Contributing

Development install:

   git clone https://github.com/freva-org/metadata-crawler.git
   cd metadata-crawler
   pip install -e .

PRs and issues welcome. Please add tests and keep examples minimal & reproducible (use the MinIO compose stack). Run:

   python -m pip install tox
   tox -e test lint types

Benchmarks

For benchmarking you can create a directory tree with roughly 1.5 M files by calling the create-cordex.sh script in the dev-env folder:

./dev-env/create-cordex.sh
python dev-env/benchmark.py --num-files 20000

See code-of-conduct.rst and whatsnew.rst for guidelines and changelog.

Use MinIO or LocalStack via docker-compose and seed a bucket (e.g., test-bucket). Then point a dataset’s fs_type = "s3" and set storage_options.

Documentation

Built with Sphinx + pydata_sphinx_theme. Build locally:

   tox -e docs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metadata_crawler-2511.1.0.tar.gz (103.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

metadata_crawler-2511.1.0-cp311-abi3-win_amd64.whl (876.2 kB view details)

Uploaded CPython 3.11+Windows x86-64

metadata_crawler-2511.1.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ x86-64

metadata_crawler-2511.1.0-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ ARM64

metadata_crawler-2511.1.0-cp311-abi3-macosx_11_0_arm64.whl (981.9 kB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

File details

Details for the file metadata_crawler-2511.1.0.tar.gz.

File metadata

  • Download URL: metadata_crawler-2511.1.0.tar.gz
  • Upload date:
  • Size: 103.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metadata_crawler-2511.1.0.tar.gz
Algorithm Hash digest
SHA256 2fe00b88980161db0204c6ed3c78af8b5177ad35a51b329a1815f21660e43574
MD5 772e115a2062e8e0c80ec24b3b7abd53
BLAKE2b-256 dd014af3df5503fae45fa800117b2d03c8dcd2fc5fedebe4e690089f74c44ec7

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 74c589b30dedf36884a71cf44da001316c15ac1f4e9f9234e894b86382d78fac
MD5 411d631a6e9761d6b7f2b422daf3ff75
BLAKE2b-256 9ef1d2bf2496471d149395d7bf2be130cbddff350798b6cb43c8cb663b9945cd

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 442bf9cd09b23812daeaafac008ddd9f3dfdcb8ea7e5b6fa33e73457f2ba79bb
MD5 16045afdc47b13e49b17f5cbee879ea6
BLAKE2b-256 943b0d58b949ca2fea6699e34f409fb223dc8e3ffb496bd5b6235582e764eeb6

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.0-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.0-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5636f677334f39fc28516a5620dd2e8d9bff040170e66159398ea40fff8e61f8
MD5 2ce840b33e7d10d03334b466917222d7
BLAKE2b-256 0478c8ecf71bcd0d8c3856d362ee684628c363aff96c143178253bd261f8b6ec

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.0-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.0-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7fecac640a94739ddf29d4ad94ddbda6cbad865406286620fa2d4cfafc80ed89
MD5 d4d80c1ef1c42c6ebab617f8e0583aeb
BLAKE2b-256 92aa85a13a879351f7e767b0514ad5534a936945335412df221070257c1b753a

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.0-cp311-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.0-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 35cd98ee9022c466450148de40120f378874e482810dd8ccf3b9b17a303be4a0
MD5 b243fd5644de7667fd76554d820142ec
BLAKE2b-256 f0d24eaf37cb377ef959af8336c1d6048d62cfe4e1c4d72a9b3b21f53dfe188f

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 39ad265d7baf665c160c4f000be6bf5360325774263651d9f96c46397f830b41
MD5 6e1a89cb38b4624ee526fce84dab999b
BLAKE2b-256 f4c331a22286e7c4a1ce2021a7ff56fed587aded2f78cfdc08df6f37d652601e

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.0-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.0-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 75f329ecb0b24c605d23aedeb192078f2082b15b3d4ab446f0662ac70597c58f
MD5 8ace4dc0a2909ba47c7408426217a28b
BLAKE2b-256 b8a83fe5ebc752d7fbdf60ecffec4a711d4b51b0a40c27057b9dbbf78d442d39

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.0-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.0-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 25e89e87bfb6aa1dd006211c5cc4c0fa66e471b9eb79a81f604f48647eb92188
MD5 f4c11190739a2b74914f9df79502001a
BLAKE2b-256 b41ad354f14c958cce95154795725367dee59b6668d3fc357fb5569ec7302ba7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page