Skip to main content

Crawl, extract and push climate metadata for indexing.

Project description

metadata-crawler

License PyPI Conda Version Docs Tests Test-Coverage

Harvest, normalise, and index climate / earth-system metadata from POSIX, S3/MinIO, and OpenStack Swift using configurable DRS dialects (CMIP6, CMIP5, CORDEX, …). Output to a temporary catalogue (JSONLines) and then index into systems such as Solr or MongoDB. Configuration is TOML with inheritance, templating, and computed rules.

TL;DR

  • Define datasets + dialects in drs_config.toml
  • mdc add → write a temporary catalogue (jsonl.gz)
  • mdc config → inspect a the (merged) crawler config.
  • mdc walk-intake → inspect the content of an intake catalogue.
  • mdc <backend> index → push records from catalogue into your index backend
  • mdc <backend> delete → remove records by facet match

Features

  • Multi-backend discovery: POSIX, S3/MinIO, Swift (async REST), Intake
  • Two-stage pipeline: crawl → catalogue then catalogue → index
  • Schema driven: strong types (e.g. string, datetime[2], float[4], string[])
  • DRS dialects: packaged CMIP6/CMIP5/CORDEX; build your own via inheritance
  • Path specs & data specs: parse directory/filename parts and/or read dataset attributes/vars
  • Special rules: conditionals, cache lookups and function calls (e.g. CMIP6 realm, time aggregation)
  • Index backends: MongoDB (Motor), Solr
  • Sync + Async APIs and a clean CLI
  • Docs: Sphinx with pydata_sphinx_theme

Install

   pip install metadata-crawler
   conda install -c conda-forge metadata-crawler

Quickstart (CLI)

   # 1) Crawl  write catalogue
   mdc add \
     cat.yaml \
     --config-file drs_config.toml \
     --dataset cmip6-fs,obs-fs \
     --threads 4 --batch-size 100

   # 2) Index from catalogue  Solr (or Mongo)
   mdc solr index \
     cat.yaml \
     --server localhot:8983

   # 3) Delete by facets (supports globs on values)
   mdc delete \
     --server localhost:8983 \
     --facets "file *.nc" --facets "project CMIP6"

[!NOTE] The CLI is a custom framework inspired by Typer (not Typer itself). Use --help on any subcommand to see all options.

Minimal config (drs_config.toml)

   # === Canonical schema ===
   [drs_settings.schema.file]
   key      = "file"
   type     = "path"
   required = true
   indexed  = true
   unique   = true

   [drs_settings.schema.uri]
   key      = "uri"
   type     = "uri"
   required = true
   indexed  = true

   [drs_settings.schema.variable]
   key          = "variable"
   type         = "string[]"
   multi_valued = true
   indexed      = true

   [drs_settings.schema.time]
   key     = "time"
   type    = "datetime[2]"     # [start, end]
   indexed = true
   default = []

   [drs_settings.schema.bbox]
   key     = "bbox"
   type    = "float[4]"        # [W,E,S,N]
   default = [0, 360, -90, 90]

   # === Dialect: CMIP6 (example) ===
   [drs_settings.dialect.cmip6]
   sources   = ["path","data"]         # path | data | storage
   defaults.grid_label = "gn"
   specs_dir  = ["mip_era","activity_id","institution_id","source_id","experiment_id","member_id","table_id","variable_id","grid_label","version"]
   specs_file = ["variable_id","table_id","source_id","experiment_id","member_id","grid_label","time"]

   [drs_settings.dialect.cmip6.special.realm]
   type   = "method"
   method = "_get_realm"
   args   = ["table_id","variable_id","__file_name__"]

   [drs_settings.dialect.cmip6.special.time_aggregation]
   type   = "method"
   method = "_get_aggregation"
   args   = ["table_id","variable_id","__file_name__"]

   # === Dialect: CORDEX (bbox by domain) ===
   [drs_settings.dialect.cordex]
   sources   = ["path","data"]
   specs_dir = ["project","product","domain","institution","driving_model","experiment","ensemble","rcm_name","rcm_version","time_frequency","variable","version"]
   specs_file= ["variable","domain","driving_model","experiment","ensemble","rcm_name","rcm_version","time_frequency","time"]

   [drs_settings.dialect.cordex.special.bbox]
   type   = "call"
   method = "dialect['cordex']['domains'].get('{{domain | upper }}', [0,360,-90,90])"

   [drs_settings.dialect.cordex.domains]
   EUR-11 = [-44.14, 64.40, 22.20, 72.42]
   AFR-44 = [-24.64, 60.28, -45.76, 42.24]

   # === Datasets ===
   [cmip6-fs]
   root_path  = "/data/model/global/cmip6"
   drs_format = "cmip6"             # dialect name
   fs_type    = "posix"

   [cmip6-s3]
   root_path        = "s3://test-bucket/data/model/global/cmip6"
   drs_format       = "cmip6"
   fs_type          = "s3"
   storage_options.endpoint_url = "http://127.0.0.1:9000"
   storage_options.aws_access_key_id = "minioadmin"
   storage_options.aws_secret_access_key = "minioadmin"
   storage_options.region_name = "us-east-1"
   storage_options.url_style   = "path"
   storage_options.use_ssl     = false

   [obs-fs]
   root_path  = "/arch/observations"
   drs_format = "custom"
   # define your specs_dir/specs_file or inherit from another dialect

Concepts

Schema (facet definitions)

Each canonical facet describes:

  • key: where to read value ("project", "variable",)
  • type: string, integer, float, datetime, with arrays like float[4], string[], datetime[2], or special types like file, uri, fs_type, dataset, fmt
  • required, default, indexed, unique, multi_valued

Dialects

A dialect tells the crawler how to interpret paths and read data:

  • sources: which sources to consult (path, data, storage) in priority
  • specs_dir / specs_file: ordered facet names encoded in directory and file names
  • data_specs: pull values from dataset content (attrs/variables); supports __variable__ and templated specs
  • special: computed fields (conditional | method | function)
  • Optional lookups (e.g., CORDEX domains for bbox)

Path specs vs data specs

  • Path specs parse segments from the path, e.g.: /project/product/institute/model/experiment/.../variable_time.nc
  • Data specs read from the dataset itself (e.g., xarray/global attribute, variable attributes, per-var stats). Example: gather all variables __variable__, then their units with a templated selector.

Inheritance

Create new dialects/datasets by inheriting:

   [drs_settings.dialect.reana]
   inherits_from = "cmip5"
   sources       = ["path","data"]
   [drs_settings.dialect.reana.data_specs.read_kws]
   engine = "h5netcdf"

Python API

Async

   import asyncio
   from metadata_crawler.run import async_add, async_index, async_delete

   async def main():
       # crawl → catalogue
       await async_add(
           "cat.yaml",
           config_file="drs_config.toml",
           dataset_names=["cmip6-fs"],
           threads=4,
           batch_size=100,
       )
       # index → backend
       await async_index(
           "solr",
           "cat.yaml",
           config_file="drs_config.toml",
           server="localhost:8983",
       )
       # delete by facets
       await async_delete(
           config_path="drs_config.toml",
           index_store="solr",
           facets=[("file", "*.nc")],
       )

   asyncio.run(main())

Sync (simple wrapper)

   import asyncio
   from metadata_crawler import add

   add(
       store="cat.yaml",
       config_file="drs_config.toml",
       dataset_names=["cmip6-fs"],
   )

Index backends

  • MongoDB (Motor): upserts by unique facet (e.g., file), bulk deletes (glob → regex)
  • Solr: fields align with managed schema; supports multi-valued facets

Contributing

Development install:

   git clone https://github.com/freva-org/metadata-crawler.git
   cd metadata-crawler
   pip install -e .

PRs and issues welcome. Please add tests and keep examples minimal & reproducible (use the MinIO compose stack). Run:

   python -m pip install tox
   tox -e test lint types

Benchmarks

For benchmarking you can create a directory tree with roughly 1.5 M files by calling the create-cordex.sh script in the dev-env folder:

./dev-env/create-cordex.sh
python dev-env/benchmark.py --num-files 20000

See code-of-conduct.rst and whatsnew.rst for guidelines and changelog.

Use MinIO or LocalStack via docker-compose and seed a bucket (e.g., test-bucket). Then point a dataset’s fs_type = "s3" and set storage_options.

Documentation

Built with Sphinx + pydata_sphinx_theme. Build locally:

   tox -e docs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metadata_crawler-2511.1.2.tar.gz (103.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

metadata_crawler-2511.1.2-cp311-abi3-win_amd64.whl (876.4 kB view details)

Uploaded CPython 3.11+Windows x86-64

metadata_crawler-2511.1.2-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ x86-64

metadata_crawler-2511.1.2-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ ARM64

metadata_crawler-2511.1.2-cp311-abi3-macosx_11_0_arm64.whl (982.0 kB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

File details

Details for the file metadata_crawler-2511.1.2.tar.gz.

File metadata

  • Download URL: metadata_crawler-2511.1.2.tar.gz
  • Upload date:
  • Size: 103.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metadata_crawler-2511.1.2.tar.gz
Algorithm Hash digest
SHA256 eeaf8feb6bfb8380aca6b7245365e6c723a27c70d99291c2f9a29ffc27dad75e
MD5 ddb96e4327f75cf5c30d4762749f19b6
BLAKE2b-256 4d1a7a2e9379f97b2b5d6e9536053e709b8c814c0719c21f835f210d9f27a773

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 025b1dec17e2d63d605791ac656b74e0d82ff73717db75c623e0485e34d541f5
MD5 5786468290aac9d3cbc7e27be0afe33e
BLAKE2b-256 32e12fc393bfcbf9fbdc73fbc30005aefbdb09760af517ecde504590db440700

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8f7220a06628685572273326bef0d5a7a67265f4dfadf63709476172ee8962b2
MD5 8a991c835036ff776a153d52b630d40a
BLAKE2b-256 7fcf3d5c8c51684c8e19a25a1b2e11bb47861abeef0e9f62417fd12af2f33ad3

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.2-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.2-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e4b052a94b675e130bbc9c1a5fffc387dc35ded9a080c5e730c96602cf149857
MD5 b1f65b8cdbce3950180dd6a39b5cda70
BLAKE2b-256 3bdbef434a96677dc81f08e3eb7403d1485c3bb591a3ae12e35a2be6e494bbf2

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.2-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.2-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6e99da380a2bcd16f89feb2410e5b8301264e89a558c4dad41323ca659a36b84
MD5 bdfefc7ec5c8666924914ce1a310f300
BLAKE2b-256 881946b0b0b4b0f03b2725f45f420729d8986594821994f9eb08a7cd11de5075

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.2-cp311-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.2-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 d8a94691fa1cee4b5c1ecf852b3b937bbb456c80327267ddd35ddb7c5e66431c
MD5 f98f3d55467baa04634bad288d389c8b
BLAKE2b-256 7cee0e803f1994d9dc3aa79468e979053d97e621d93afe4c701c188fb5f125c2

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.2-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.2-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ec4d8f0c6aa1c548fb6f8a65dee2e1883ca916191c524a392c6f746c40f59186
MD5 34d7115bd81bee01c07d43bdac2189f5
BLAKE2b-256 20aa5bbe085d7ef63908de6e4c7033d415319db1d70e1342960ea50206f77837

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.2-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.2-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 91310977e3de226adee10cc475f45e444871d0595ce331f5c8dbd2feba73f315
MD5 d945f66edd2bbd30f9434104dafac96d
BLAKE2b-256 49b19858f286edb58d9300336dee872572fe2ce3412c84c7a3c678fb0f8ccd86

See more details on using hashes here.

File details

Details for the file metadata_crawler-2511.1.2-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for metadata_crawler-2511.1.2-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3f54d76adeab2a1507b5f6cae10affcf15cc8baef07205338b9d49ad4618a30f
MD5 aad42d4614550008298fc5285be8fd2f
BLAKE2b-256 014e88a3761b335222b69cb6c63af2ac5392744a26e00ed3a52da0453de8a3b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page