Crawl, extract and push climate metadata for indexing.
Project description
metadata-crawler
Harvest, normalise, and index climate / earth-system metadata from POSIX, S3/MinIO, and OpenStack Swift using configurable DRS dialects (CMIP6, CMIP5, CORDEX, …). Output to a temporary catalogue (JSONLines) and then index into systems such as Solr or MongoDB. Configuration is TOML with inheritance, templating, and computed rules.
TL;DR
- Define datasets + dialects in
drs_config.toml mdc add→ write a temporary catalogue (jsonl.gz)mdc config→ inspect a the (merged) crawler config.mdc walk-intake→ inspect the content of an intake catalogue.mdc <backend> index→ push records from catalogue into your index backendmdc <backend> delete→ remove records by facet match
Features
- Multi-backend discovery: POSIX, S3/MinIO, Swift (async REST), Intake
- Two-stage pipeline: crawl → catalogue then catalogue → index
- Schema driven: strong types (e.g.
string,datetime[2],float[4],string[]) - DRS dialects: packaged CMIP6/CMIP5/CORDEX; build your own via inheritance
- Path specs & data specs: parse directory/filename parts and/or read dataset attributes/vars
- Special rules: conditionals, cache lookups and function calls (e.g. CMIP6 realm, time aggregation)
- Index backends: MongoDB (Motor), Solr
- Sync + Async APIs and a clean CLI
- Docs: Sphinx with
pydata_sphinx_theme
Install
pip install metadata-crawler
conda install -c conda-forge metadata-crawler
Quickstart (CLI)
# 1) Crawl → write catalogue
mdc add \
cat.yaml \
--config-file drs_config.toml \
--dataset cmip6-fs,obs-fs \
--threads 4 --batch-size 100
# 2) Index from catalogue → Solr (or Mongo)
mdc solr index \
cat.yaml \
--server localhot:8983
# 3) Delete by facets (supports globs on values)
mdc delete \
--server localhost:8983 \
--facets "file *.nc" --facets "project CMIP6"
[!NOTE] The CLI is a custom framework inspired by Typer (not Typer itself). Use
--helpon any subcommand to see all options.
Minimal config (drs_config.toml)
# === Canonical schema ===
[drs_settings.schema.file]
key = "file"
type = "path"
required = true
indexed = true
unique = true
[drs_settings.schema.uri]
key = "uri"
type = "uri"
required = true
indexed = true
[drs_settings.schema.variable]
key = "variable"
type = "string[]"
multi_valued = true
indexed = true
[drs_settings.schema.time]
key = "time"
type = "datetime[2]" # [start, end]
indexed = true
default = []
[drs_settings.schema.bbox]
key = "bbox"
type = "float[4]" # [W,E,S,N]
default = [0, 360, -90, 90]
# === Dialect: CMIP6 (example) ===
[drs_settings.dialect.cmip6]
sources = ["path","data"] # path | data | storage
defaults.grid_label = "gn"
specs_dir = ["mip_era","activity_id","institution_id","source_id","experiment_id","member_id","table_id","variable_id","grid_label","version"]
specs_file = ["variable_id","table_id","source_id","experiment_id","member_id","grid_label","time"]
[drs_settings.dialect.cmip6.special.realm]
type = "method"
method = "_get_realm"
args = ["table_id","variable_id","__file_name__"]
[drs_settings.dialect.cmip6.special.time_aggregation]
type = "method"
method = "_get_aggregation"
args = ["table_id","variable_id","__file_name__"]
# === Dialect: CORDEX (bbox by domain) ===
[drs_settings.dialect.cordex]
sources = ["path","data"]
specs_dir = ["project","product","domain","institution","driving_model","experiment","ensemble","rcm_name","rcm_version","time_frequency","variable","version"]
specs_file= ["variable","domain","driving_model","experiment","ensemble","rcm_name","rcm_version","time_frequency","time"]
[drs_settings.dialect.cordex.special.bbox]
type = "call"
method = "dialect['cordex']['domains'].get('{{domain | upper }}', [0,360,-90,90])"
[drs_settings.dialect.cordex.domains]
EUR-11 = [-44.14, 64.40, 22.20, 72.42]
AFR-44 = [-24.64, 60.28, -45.76, 42.24]
# === Datasets ===
[cmip6-fs]
root_path = "/data/model/global/cmip6"
drs_format = "cmip6" # dialect name
fs_type = "posix"
[cmip6-s3]
root_path = "s3://test-bucket/data/model/global/cmip6"
drs_format = "cmip6"
fs_type = "s3"
storage_options.endpoint_url = "http://127.0.0.1:9000"
storage_options.aws_access_key_id = "minioadmin"
storage_options.aws_secret_access_key = "minioadmin"
storage_options.region_name = "us-east-1"
storage_options.url_style = "path"
storage_options.use_ssl = false
[obs-fs]
root_path = "/arch/observations"
drs_format = "custom"
# define your specs_dir/specs_file or inherit from another dialect
Concepts
Schema (facet definitions)
Each canonical facet describes:
key: where to read value ("project","variable",)type:string,integer,float,datetime, with arrays likefloat[4],string[],datetime[2], or special types likefile,uri,fs_type,dataset,fmtrequired,default,indexed,unique,multi_valued
Dialects
A dialect tells the crawler how to interpret paths and read data:
sources: which sources to consult (path,data,storage) in priorityspecs_dir/specs_file: ordered facet names encoded in directory and file namesdata_specs: pull values from dataset content (attrs/variables); supports__variable__and templated specsspecial: computed fields (conditional|method|function)- Optional lookups (e.g., CORDEX
domainsfor bbox)
Path specs vs data specs
- Path specs parse segments from the path, e.g.:
/project/product/institute/model/experiment/.../variable_time.nc - Data specs read from the dataset itself (e.g., xarray/global attribute, variable
attributes, per-var stats). Example: gather all variables
__variable__, then their units with a templated selector.
Inheritance
Create new dialects/datasets by inheriting:
[drs_settings.dialect.reana]
inherits_from = "cmip5"
sources = ["path","data"]
[drs_settings.dialect.reana.data_specs.read_kws]
engine = "h5netcdf"
Python API
Async
import asyncio
from metadata_crawler.run import async_add, async_index, async_delete
async def main():
# crawl → catalogue
await async_add(
"cat.yaml",
config_file="drs_config.toml",
dataset_names=["cmip6-fs"],
threads=4,
batch_size=100,
)
# index → backend
await async_index(
"solr",
"cat.yaml",
config_file="drs_config.toml",
server="localhost:8983",
)
# delete by facets
await async_delete(
config_path="drs_config.toml",
index_store="solr",
facets=[("file", "*.nc")],
)
asyncio.run(main())
Sync (simple wrapper)
import asyncio
from metadata_crawler import add
add(
store="cat.yaml",
config_file="drs_config.toml",
dataset_names=["cmip6-fs"],
)
Index backends
- MongoDB (Motor): upserts by unique facet (e.g.,
file), bulk deletes (glob → regex) - Solr: fields align with managed schema; supports multi-valued facets
Contributing
Development install:
git clone https://github.com/freva-org/metadata-crawler.git
cd metadata-crawler
pip install -e .
PRs and issues welcome. Please add tests and keep examples minimal & reproducible (use the MinIO compose stack). Run:
python -m pip install tox
tox -e test lint types
Benchmarks
For benchmarking you can create a directory tree with roughly 1.5 M files by
calling the create-cordex.sh script in the dev-env folder:
./dev-env/create-cordex.sh
python dev-env/benchmark.py --num-files 20000
See code-of-conduct.rst and whatsnew.rst for guidelines and changelog.
Use MinIO or LocalStack via docker-compose and seed a bucket (e.g., test-bucket).
Then point a dataset’s fs_type = "s3" and set storage_options.
Documentation
Built with Sphinx + pydata_sphinx_theme. Build locally:
tox -e docs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file metadata_crawler-2511.1.0.tar.gz.
File metadata
- Download URL: metadata_crawler-2511.1.0.tar.gz
- Upload date:
- Size: 103.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2fe00b88980161db0204c6ed3c78af8b5177ad35a51b329a1815f21660e43574
|
|
| MD5 |
772e115a2062e8e0c80ec24b3b7abd53
|
|
| BLAKE2b-256 |
dd014af3df5503fae45fa800117b2d03c8dcd2fc5fedebe4e690089f74c44ec7
|
File details
Details for the file metadata_crawler-2511.1.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: metadata_crawler-2511.1.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.1 MB
- Tags: PyPy, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74c589b30dedf36884a71cf44da001316c15ac1f4e9f9234e894b86382d78fac
|
|
| MD5 |
411d631a6e9761d6b7f2b422daf3ff75
|
|
| BLAKE2b-256 |
9ef1d2bf2496471d149395d7bf2be130cbddff350798b6cb43c8cb663b9945cd
|
File details
Details for the file metadata_crawler-2511.1.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: metadata_crawler-2511.1.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.1 MB
- Tags: PyPy, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
442bf9cd09b23812daeaafac008ddd9f3dfdcb8ea7e5b6fa33e73457f2ba79bb
|
|
| MD5 |
16045afdc47b13e49b17f5cbee879ea6
|
|
| BLAKE2b-256 |
943b0d58b949ca2fea6699e34f409fb223dc8e3ffb496bd5b6235582e764eeb6
|
File details
Details for the file metadata_crawler-2511.1.0-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: metadata_crawler-2511.1.0-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.1 MB
- Tags: PyPy, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5636f677334f39fc28516a5620dd2e8d9bff040170e66159398ea40fff8e61f8
|
|
| MD5 |
2ce840b33e7d10d03334b466917222d7
|
|
| BLAKE2b-256 |
0478c8ecf71bcd0d8c3856d362ee684628c363aff96c143178253bd261f8b6ec
|
File details
Details for the file metadata_crawler-2511.1.0-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: metadata_crawler-2511.1.0-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.1 MB
- Tags: PyPy, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fecac640a94739ddf29d4ad94ddbda6cbad865406286620fa2d4cfafc80ed89
|
|
| MD5 |
d4d80c1ef1c42c6ebab617f8e0583aeb
|
|
| BLAKE2b-256 |
92aa85a13a879351f7e767b0514ad5534a936945335412df221070257c1b753a
|
File details
Details for the file metadata_crawler-2511.1.0-cp311-abi3-win_amd64.whl.
File metadata
- Download URL: metadata_crawler-2511.1.0-cp311-abi3-win_amd64.whl
- Upload date:
- Size: 876.2 kB
- Tags: CPython 3.11+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35cd98ee9022c466450148de40120f378874e482810dd8ccf3b9b17a303be4a0
|
|
| MD5 |
b243fd5644de7667fd76554d820142ec
|
|
| BLAKE2b-256 |
f0d24eaf37cb377ef959af8336c1d6048d62cfe4e1c4d72a9b3b21f53dfe188f
|
File details
Details for the file metadata_crawler-2511.1.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: metadata_crawler-2511.1.0-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.11+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39ad265d7baf665c160c4f000be6bf5360325774263651d9f96c46397f830b41
|
|
| MD5 |
6e1a89cb38b4624ee526fce84dab999b
|
|
| BLAKE2b-256 |
f4c331a22286e7c4a1ce2021a7ff56fed587aded2f78cfdc08df6f37d652601e
|
File details
Details for the file metadata_crawler-2511.1.0-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: metadata_crawler-2511.1.0-cp311-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.11+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75f329ecb0b24c605d23aedeb192078f2082b15b3d4ab446f0662ac70597c58f
|
|
| MD5 |
8ace4dc0a2909ba47c7408426217a28b
|
|
| BLAKE2b-256 |
b8a83fe5ebc752d7fbdf60ecffec4a711d4b51b0a40c27057b9dbbf78d442d39
|
File details
Details for the file metadata_crawler-2511.1.0-cp311-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: metadata_crawler-2511.1.0-cp311-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 981.9 kB
- Tags: CPython 3.11+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25e89e87bfb6aa1dd006211c5cc4c0fa66e471b9eb79a81f604f48647eb92188
|
|
| MD5 |
f4c11190739a2b74914f9df79502001a
|
|
| BLAKE2b-256 |
b41ad354f14c958cce95154795725367dee59b6668d3fc357fb5569ec7302ba7
|