Skip to main content

Crawl, extract and push climate metadata for indexing.

Project description

metadata-crawler

License PyPI Conda Version Docs Tests Test-Coverage

Harvest, normalise, and index climate / earth-system metadata from POSIX, S3/MinIO, and OpenStack Swift using configurable DRS dialects (CMIP6, CMIP5, CORDEX, …). Output to a temporary catalogue (JSONLines) and then index into systems such as Solr or MongoDB. Configuration is TOML with inheritance, templating, and computed rules.

TL;DR

  • Define datasets + dialects in drs_config.toml
  • mdc add → write a temporary catalogue (jsonl.gz)
  • mdc config → inspect a the (merged) crawler config.
  • mdc walk-intake → inspect the content of an intake catalogue.
  • mdc <backend> index → push records from catalogue into your index backend
  • mdc <backend> delete → remove records by facet match

Features

  • Multi-backend discovery: POSIX, S3/MinIO, Swift (async REST), Intake
  • Two-stage pipeline: crawl → catalogue then catalogue → index
  • Schema driven: strong types (e.g. string, datetime[2], float[4], string[])
  • DRS dialects: packaged CMIP6/CMIP5/CORDEX; build your own via inheritance
  • Path specs & data specs: parse directory/filename parts and/or read dataset attributes/vars
  • Special rules: conditionals, cache lookups and function calls (e.g. CMIP6 realm, time aggregation)
  • Index backends: MongoDB (Motor), Solr
  • Sync + Async APIs and a clean CLI
  • Docs: Sphinx with pydata_sphinx_theme

Install

   pip install metadata-crawler
   conda install -c conda-forge metadata-crawler

Quickstart (CLI)

   # 1) Crawl  write catalogue
   mdc add \
     cat.yaml \
     --config-file drs_config.toml \
     --dataset cmip6-fs,obs-fs \
     --threads 4 --batch-size 100

   # 2) Index from catalogue  Solr (or Mongo)
   mdc solr index \
     cat.yaml \
     --server localhot:8983

   # 3) Delete by facets (supports globs on values)
   mdc delete \
     --server localhost:8983 \
     --facets "file *.nc" --facets "project CMIP6"

[!NOTE] The CLI is a custom framework inspired by Typer (not Typer itself). Use --help on any subcommand to see all options.

Minimal config (drs_config.toml)

   # === Canonical schema ===
   [drs_settings.schema.file]
   key      = "file"
   type     = "path"
   required = true
   indexed  = true
   unique   = true

   [drs_settings.schema.uri]
   key      = "uri"
   type     = "uri"
   required = true
   indexed  = true

   [drs_settings.schema.variable]
   key          = "variable"
   type         = "string[]"
   multi_valued = true
   indexed      = true

   [drs_settings.schema.time]
   key     = "time"
   type    = "datetime[2]"     # [start, end]
   indexed = true
   default = []

   [drs_settings.schema.bbox]
   key     = "bbox"
   type    = "float[4]"        # [W,E,S,N]
   default = [0, 360, -90, 90]

   # === Dialect: CMIP6 (example) ===
   [drs_settings.dialect.cmip6]
   sources   = ["path","data"]         # path | data | storage
   defaults.grid_label = "gn"
   specs_dir  = ["mip_era","activity_id","institution_id","source_id","experiment_id","member_id","table_id","variable_id","grid_label","version"]
   specs_file = ["variable_id","table_id","source_id","experiment_id","member_id","grid_label","time"]

   [drs_settings.dialect.cmip6.special.realm]
   type   = "method"
   method = "_get_realm"
   args   = ["table_id","variable_id","__file_name__"]

   [drs_settings.dialect.cmip6.special.time_aggregation]
   type   = "method"
   method = "_get_aggregation"
   args   = ["table_id","variable_id","__file_name__"]

   # === Dialect: CORDEX (bbox by domain) ===
   [drs_settings.dialect.cordex]
   sources   = ["path","data"]
   specs_dir = ["project","product","domain","institution","driving_model","experiment","ensemble","rcm_name","rcm_version","time_frequency","variable","version"]
   specs_file= ["variable","domain","driving_model","experiment","ensemble","rcm_name","rcm_version","time_frequency","time"]

   [drs_settings.dialect.cordex.special.bbox]
   type   = "call"
   method = "dialect['cordex']['domains'].get('{{domain | upper }}', [0,360,-90,90])"

   [drs_settings.dialect.cordex.domains]
   EUR-11 = [-44.14, 64.40, 22.20, 72.42]
   AFR-44 = [-24.64, 60.28, -45.76, 42.24]

   # === Datasets ===
   [cmip6-fs]
   root_path  = "/data/model/global/cmip6"
   drs_format = "cmip6"             # dialect name
   fs_type    = "posix"

   [cmip6-s3]
   root_path        = "s3://test-bucket/data/model/global/cmip6"
   drs_format       = "cmip6"
   fs_type          = "s3"
   storage_options.endpoint_url = "http://127.0.0.1:9000"
   storage_options.aws_access_key_id = "minioadmin"
   storage_options.aws_secret_access_key = "minioadmin"
   storage_options.region_name = "us-east-1"
   storage_options.url_style   = "path"
   storage_options.use_ssl     = false

   [obs-fs]
   root_path  = "/arch/observations"
   drs_format = "custom"
   # define your specs_dir/specs_file or inherit from another dialect

Concepts

Schema (facet definitions)

Each canonical facet describes:

  • key: where to read value ("project", "variable",)
  • type: string, integer, float, datetime, with arrays like float[4], string[], datetime[2], or special types like file, uri, fs_type, dataset, fmt
  • required, default, indexed, unique, multi_valued

Dialects

A dialect tells the crawler how to interpret paths and read data:

  • sources: which sources to consult (path, data, storage) in priority
  • specs_dir / specs_file: ordered facet names encoded in directory and file names
  • data_specs: pull values from dataset content (attrs/variables); supports __variable__ and templated specs
  • special: computed fields (conditional | method | function)
  • Optional lookups (e.g., CORDEX domains for bbox)

Path specs vs data specs

  • Path specs parse segments from the path, e.g.: /project/product/institute/model/experiment/.../variable_time.nc
  • Data specs read from the dataset itself (e.g., xarray/global attribute, variable attributes, per-var stats). Example: gather all variables __variable__, then their units with a templated selector.

Inheritance

Create new dialects/datasets by inheriting:

   [drs_settings.dialect.reana]
   inherits_from = "cmip5"
   sources       = ["path","data"]
   [drs_settings.dialect.reana.data_specs.read_kws]
   engine = "h5netcdf"

Python API

Async

   import asyncio
   from metadata_crawler.run import async_add, async_index, async_delete

   async def main():
       # crawl → catalogue
       await async_add(
           "cat.yaml",
           config_file="drs_config.toml",
           dataset_names=["cmip6-fs"],
           threads=4,
           batch_size=100,
       )
       # index → backend
       await async_index(
           "solr",
           "cat.yaml",
           config_file="drs_config.toml",
           server="localhost:8983",
       )
       # delete by facets
       await async_delete(
           config_path="drs_config.toml",
           index_store="solr",
           facets=[("file", "*.nc")],
       )

   asyncio.run(main())

Sync (simple wrapper)

   import asyncio
   from metadata_crawler import add

   add(
       store="cat.yaml",
       config_file="drs_config.toml",
       dataset_names=["cmip6-fs"],
   )

Index backends

  • MongoDB (Motor): upserts by unique facet (e.g., file), bulk deletes (glob → regex)
  • Solr: fields align with managed schema; supports multi-valued facets

Contributing

Development install:

   git clone https://github.com/freva-org/metadata-crawler.git
   cd metadata-crawler
   pip install -e .

PRs and issues welcome. Please add tests and keep examples minimal & reproducible (use the MinIO compose stack). Run:

   python -m pip install tox
   tox -e test lint types

Benchmarks

For benchmarking you can create a directory tree with roughly 1.5 M files by calling the create-cordex.sh script in the dev-env folder:

./dev-env/create-cordex.sh
python dev-env/benchmark.py --num-files 20000

See code-of-conduct.rst and whatsnew.rst for guidelines and changelog.

Use MinIO or LocalStack via docker-compose and seed a bucket (e.g., test-bucket). Then point a dataset’s fs_type = "s3" and set storage_options.

Documentation

Built with Sphinx + pydata_sphinx_theme. Build locally:

   tox -e docs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metadata_crawler-2511.2.0.tar.gz (103.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

metadata_crawler-2511.2.0-cp311-abi3-win_amd64.whl (876.4 kB view details)

Uploaded CPython 3.11+Windows x86-64

metadata_crawler-2511.2.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ x86-64

metadata_crawler-2511.2.0-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ ARM64

metadata_crawler-2511.2.0-cp311-abi3-macosx_11_0_arm64.whl (982.0 kB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

File details

Details for the file metadata_crawler-2511.2.0.tar.gz.

File metadata

  • Download URL: metadata_crawler-2511.2.0.tar.gz
  • Upload date:
  • Size: 103.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metadata_crawler-2511.2.0.tar.gz
Algorithm Hash digest
SHA256 a64354dce42f5abc42dda102b4812ae15c88dc0c25c51a349d66c6d3ebb82a8d
MD5 611c59a3a6d6ecb6db14eecbcbe8296b
BLAKE2b-256 d688107836a1de04c431c45e41bcb41614dae4dec3a92fe5cad608fe385b96d5

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.2.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.2.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4c65b59606ad771b2cbe9a8a60a22863fd21c7da8e67cf9cae15a9e7eb959192
MD5 84a261a150a67779959e8c21d3702108
BLAKE2b-256 c3ed32d078495805937e656397e27fa2ac1a2f4b406086d99c803609556e5d70

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.2.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.2.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5a2c54eda4edb7d7fa3c87e2916b1497c1f414c70181063815744d1dc93caf7f
MD5 f5b18544c5d1c318799c2ba5f58c4349
BLAKE2b-256 ed8d85df0ab5b3957e95560dea4f685cc0d93406bbb5d57090ceb0d485b90829

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.2.0-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.2.0-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f44e579a33a1da27019e95a688ce444a283b280bbfa673928b29b7a74cea18b2
MD5 82f58823acb17648a431b32bade9d512
BLAKE2b-256 2501f294e04f6edcce8743e91c959560bd1b64510fb3d0eb0ae4324641717e0f

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.2.0-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.2.0-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 0e5b97318671e16cdb940b95cc46ca615b645e52eaef1d1347ffe956977cfcc0
MD5 52b1c6bef667b885de403b0fde3893f3
BLAKE2b-256 0e2d0739340519022c11cc7e7305b63a24a2fc6d551025c2f68305c76bef3164

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.2.0-cp311-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.2.0-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b854babe0eabdda616a3dbc01e83bbb980b040cfbccc071f3351003782176c6a
MD5 fdebbb838d728124eb61ec7dcf169ad2
BLAKE2b-256 036af6f1d0163b971f2f97f9b7d207aba6b7b37876f186fc331028b1d8e7c03f

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.2.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.2.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 75813931e15e0eedeca3302c841dca58accc86fa4dc9e9a749195a146d2e55ac
MD5 ddfbe9e7e54593fa71aa12232b3e9ceb
BLAKE2b-256 7ba3729074b1016c620f548422bc4f45ce2ac90ef9bba63005dade5b874d190f

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.2.0-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.2.0-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8fcd818b9b9428341d78a99248cb996e387d1bc47bebf4a59442ebdfea3d692a
MD5 191ace483e70c050660e94660624cef3
BLAKE2b-256 26e03b645427023d0a584542557d8a683cac066a73cf5706a80a74a8a560817b

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.2.0-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.2.0-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0ffb2f97716425c61092bd1b3886988b3c748ab215c34718d7a7f62f47a69f13
MD5 5aa4e5b4bff3c1ef3917d8185ae92e91
BLAKE2b-256 433379c86e38e8455c1ea3d4c2b7eed8b77509178bd8c2b2d3979f0afc28926c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page