Skip to main content

Crawl, extract and push climate metadata for indexing.

Project description

metadata-crawler

License PyPI Conda Version Docs Tests Test-Coverage

Harvest, normalise, and index climate / earth-system metadata from POSIX, S3/MinIO, and OpenStack Swift using configurable DRS dialects (CMIP6, CMIP5, CORDEX, …). Output to a temporary catalogue (JSONLines) and then index into systems such as Solr or MongoDB. Configuration is TOML with inheritance, templating, and computed rules.

TL;DR

  • Define datasets + dialects in drs_config.toml
  • mdc add → write a temporary catalogue (jsonl.gz)
  • mdc config → inspect a the (merged) crawler config.
  • mdc walk-intake → inspect the content of an intake catalogue.
  • mdc <backend> index → push records from catalogue into your index backend
  • mdc <backend> delete → remove records by facet match

Features

  • Multi-backend discovery: POSIX, S3/MinIO, Swift (async REST), Intake
  • Two-stage pipeline: crawl → catalogue then catalogue → index
  • Schema driven: strong types (e.g. string, datetime[2], float[4], string[])
  • DRS dialects: packaged CMIP6/CMIP5/CORDEX; build your own via inheritance
  • Path specs & data specs: parse directory/filename parts and/or read dataset attributes/vars
  • Special rules: conditionals, cache lookups and function calls (e.g. CMIP6 realm, time aggregation)
  • Index backends: MongoDB (Motor), Solr
  • Sync + Async APIs and a clean CLI
  • Docs: Sphinx with pydata_sphinx_theme

Install

   pip install metadata-crawler
   conda install -c conda-forge metadata-crawler

Quickstart (CLI)

   # 1) Crawl  write catalogue
   mdc add \
     cat.yaml \
     --config-file drs_config.toml \
     --dataset cmip6-fs,obs-fs \
     --threads 4 --batch-size 100

   # 2) Index from catalogue  Solr (or Mongo)
   mdc solr index \
     cat.yaml \
     --server localhot:8983

   # 3) Delete by facets (supports globs on values)
   mdc delete \
     --server localhost:8983 \
     --facets "file *.nc" --facets "project CMIP6"

[!NOTE] The CLI is a custom framework inspired by Typer (not Typer itself). Use --help on any subcommand to see all options.

Minimal config (drs_config.toml)

   # === Canonical schema ===
   [drs_settings.schema.file]
   key      = "file"
   type     = "path"
   required = true
   indexed  = true
   unique   = true

   [drs_settings.schema.uri]
   key      = "uri"
   type     = "uri"
   required = true
   indexed  = true

   [drs_settings.schema.variable]
   key          = "variable"
   type         = "string[]"
   multi_valued = true
   indexed      = true

   [drs_settings.schema.time]
   key     = "time"
   type    = "datetime[2]"     # [start, end]
   indexed = true
   default = []

   [drs_settings.schema.bbox]
   key     = "bbox"
   type    = "float[4]"        # [W,E,S,N]
   default = [0, 360, -90, 90]

   # === Dialect: CMIP6 (example) ===
   [drs_settings.dialect.cmip6]
   sources   = ["path","data"]         # path | data | storage
   defaults.grid_label = "gn"
   specs_dir  = ["mip_era","activity_id","institution_id","source_id","experiment_id","member_id","table_id","variable_id","grid_label","version"]
   specs_file = ["variable_id","table_id","source_id","experiment_id","member_id","grid_label","time"]

   [drs_settings.dialect.cmip6.special.realm]
   type   = "method"
   method = "_get_realm"
   args   = ["table_id","variable_id","__file_name__"]

   [drs_settings.dialect.cmip6.special.time_aggregation]
   type   = "method"
   method = "_get_aggregation"
   args   = ["table_id","variable_id","__file_name__"]

   # === Dialect: CORDEX (bbox by domain) ===
   [drs_settings.dialect.cordex]
   sources   = ["path","data"]
   specs_dir = ["project","product","domain","institution","driving_model","experiment","ensemble","rcm_name","rcm_version","time_frequency","variable","version"]
   specs_file= ["variable","domain","driving_model","experiment","ensemble","rcm_name","rcm_version","time_frequency","time"]

   [drs_settings.dialect.cordex.special.bbox]
   type   = "call"
   method = "dialect['cordex']['domains'].get('{{domain | upper }}', [0,360,-90,90])"

   [drs_settings.dialect.cordex.domains]
   EUR-11 = [-44.14, 64.40, 22.20, 72.42]
   AFR-44 = [-24.64, 60.28, -45.76, 42.24]

   # === Datasets ===
   [cmip6-fs]
   root_path  = "/data/model/global/cmip6"
   drs_format = "cmip6"             # dialect name
   fs_type    = "posix"

   [cmip6-s3]
   root_path        = "s3://test-bucket/data/model/global/cmip6"
   drs_format       = "cmip6"
   fs_type          = "s3"
   storage_options.endpoint_url = "http://127.0.0.1:9000"
   storage_options.aws_access_key_id = "minioadmin"
   storage_options.aws_secret_access_key = "minioadmin"
   storage_options.region_name = "us-east-1"
   storage_options.url_style   = "path"
   storage_options.use_ssl     = false

   [obs-fs]
   root_path  = "/arch/observations"
   drs_format = "custom"
   # define your specs_dir/specs_file or inherit from another dialect

Concepts

Schema (facet definitions)

Each canonical facet describes:

  • key: where to read value ("project", "variable",)
  • type: string, integer, float, datetime, with arrays like float[4], string[], datetime[2], or special types like file, uri, fs_type, dataset, fmt
  • required, default, indexed, unique, multi_valued

Dialects

A dialect tells the crawler how to interpret paths and read data:

  • sources: which sources to consult (path, data, storage) in priority
  • specs_dir / specs_file: ordered facet names encoded in directory and file names
  • data_specs: pull values from dataset content (attrs/variables); supports __variable__ and templated specs
  • special: computed fields (conditional | method | function)
  • Optional lookups (e.g., CORDEX domains for bbox)

Path specs vs data specs

  • Path specs parse segments from the path, e.g.: /project/product/institute/model/experiment/.../variable_time.nc
  • Data specs read from the dataset itself (e.g., xarray/global attribute, variable attributes, per-var stats). Example: gather all variables __variable__, then their units with a templated selector.

Inheritance

Create new dialects/datasets by inheriting:

   [drs_settings.dialect.reana]
   inherits_from = "cmip5"
   sources       = ["path","data"]
   [drs_settings.dialect.reana.data_specs.read_kws]
   engine = "h5netcdf"

Python API

Async

   import asyncio
   from metadata_crawler.run import async_add, async_index, async_delete

   async def main():
       # crawl → catalogue
       await async_add(
           "cat.yaml",
           config_file="drs_config.toml",
           dataset_names=["cmip6-fs"],
           threads=4,
           batch_size=100,
       )
       # index → backend
       await async_index(
           "solr",
           "cat.yaml",
           config_file="drs_config.toml",
           server="localhost:8983",
       )
       # delete by facets
       await async_delete(
           config_path="drs_config.toml",
           index_store="solr",
           facets=[("file", "*.nc")],
       )

   asyncio.run(main())

Sync (simple wrapper)

   import asyncio
   from metadata_crawler import add

   add(
       store="cat.yaml",
       config_file="drs_config.toml",
       dataset_names=["cmip6-fs"],
   )

Index backends

  • MongoDB (Motor): upserts by unique facet (e.g., file), bulk deletes (glob → regex)
  • Solr: fields align with managed schema; supports multi-valued facets

Contributing

Development install:

   git clone https://github.com/freva-org/metadata-crawler.git
   cd metadata-crawler
   pip install -e .

PRs and issues welcome. Please add tests and keep examples minimal & reproducible (use the MinIO compose stack). Run:

   python -m pip install tox
   tox -e test lint types

Benchmarks

For benchmarking you can create a directory tree with roughly 1.5 M files by calling the create-cordex.sh script in the dev-env folder:

./dev-env/create-cordex.sh
python dev-env/benchmark.py --num-files 20000

See code-of-conduct.rst and whatsnew.rst for guidelines and changelog.

Use MinIO or LocalStack via docker-compose and seed a bucket (e.g., test-bucket). Then point a dataset’s fs_type = "s3" and set storage_options.

Documentation

Built with Sphinx + pydata_sphinx_theme. Build locally:

   tox -e docs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metadata_crawler-2602.0.0.tar.gz (103.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

metadata_crawler-2602.0.0-cp311-abi3-win_amd64.whl (856.9 kB view details)

Uploaded CPython 3.11+Windows x86-64

metadata_crawler-2602.0.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ x86-64

metadata_crawler-2602.0.0-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ ARM64

metadata_crawler-2602.0.0-cp311-abi3-macosx_11_0_arm64.whl (965.3 kB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

File details

Details for the file metadata_crawler-2602.0.0.tar.gz.

File metadata

  • Download URL: metadata_crawler-2602.0.0.tar.gz
  • Upload date:
  • Size: 103.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metadata_crawler-2602.0.0.tar.gz
Algorithm Hash digest
SHA256 102c14ae24ddd47bdf8f36a28505f28a20877839f4e0100d3afbd7116ef94b16
MD5 ea2c6992807f0b42f36b3fd58d7eb03d
BLAKE2b-256 22ee908ea3c96fb10da5347b01651e3c68445e35c2dbe496bc988f87e6c0a8d5

See more details on using hashes here.

File details

Details for the file metadata_crawler-2602.0.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2602.0.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e506e906b83eb24f325cbb70e189019062b2011ce1f2b854f1244f8a7d309eb3
MD5 78eb545682332a91b6ae53c32928c6ed
BLAKE2b-256 d872a77591dc4d4a15443e15aca965f6a895c29ccaa6df2c76326bc1569455aa

See more details on using hashes here.

File details

Details for the file metadata_crawler-2602.0.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2602.0.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e00ba06e7f219f006f3e53beda2b283f5871fe109b7a50ebc80d07a4f3360bb2
MD5 cba7231abc317665a06aaf7c433fb376
BLAKE2b-256 a4d1cf4f284f38395112bd4584f327ca0b7e0724fae3e30ecd30b3cfb6df6f31

See more details on using hashes here.

File details

Details for the file metadata_crawler-2602.0.0-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2602.0.0-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 dce6c7fe0b4853aa3c461f79a29599aff7929abf35450322350727582889a2dd
MD5 ac04959f85eaac715057f26770254ebb
BLAKE2b-256 1f8dd620c3aa17b97c17391fea1dfd15f248f359f9f1e5334d23bc60595d3d97

See more details on using hashes here.

File details

Details for the file metadata_crawler-2602.0.0-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2602.0.0-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8444c01c8766870dbaf85d2e7cf21d697d8b21bbabaca2bbc4485a136c30991b
MD5 13223536c63de3e797c6ee53327bba24
BLAKE2b-256 3f846d8cd3e524190602c53d1d1eb5fb2aca8464fd5f5c5432bf077c6e0266e1

See more details on using hashes here.

File details

Details for the file metadata_crawler-2602.0.0-cp311-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2602.0.0-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b62730fb5ecb2bb56102b53085bce086efc9d0b6e579daba10d894072ad199fa
MD5 372e7db936e77b5d8b6512f48d20143f
BLAKE2b-256 f9c1164c26066b7fc2deaf8a93a2ee05d8802c195a694c05df782c6525cdf52f

See more details on using hashes here.

File details

Details for the file metadata_crawler-2602.0.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2602.0.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8151a3dd58f350d8e7746acacbf5b66520b9e6387ff52600ff2456d2e6bddadf
MD5 56c6fc84a3bb4b12f7375ce3e3c2fb8f
BLAKE2b-256 840e1837c8c34d09c3f1095d3bae211e9a538b632df1fa4c3541f4bed147d093

See more details on using hashes here.

File details

Details for the file metadata_crawler-2602.0.0-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2602.0.0-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 effc1511711dc7eefa61c7c0ee93cb0ead29724618e75ac4b2a86440736dbb6d
MD5 8522c78fa0f915246420e0aa4bf3787c
BLAKE2b-256 927f553068268eec15c05fd9273c44fb8e409ac7e900b6462049ae4a32d80656

See more details on using hashes here.

File details

Details for the file metadata_crawler-2602.0.0-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2602.0.0-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b436e7ecb21ae36adec90c5d20e6dd6592646027fc3e85e928f846959b123824
MD5 9eb13f0bc857747366d734e7f80c21a3
BLAKE2b-256 c2707b35bd8a058d42476fb07b9507810f63d7ee0df91d508906cda16f0e068d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page