DataHub ingestion source for Arraylake (Earthmover) Icechunk datasets
Project description
arraylake-datahub
DataHub ingestion source for Arraylake / Icechunk datasets.
The plugin crawls the Arraylake catalog over HTTPS and emits one DataHub
Dataset per xarray-compatible Zarr group in your repos. Access
control, credential vending, and querying stay in Arraylake — DataHub
becomes the discovery and metadata-search surface, with externalUrl
links back into the Arraylake web app for the actual data.
📖 Full documentation: Arraylake docs › Integrations › Catalogs › DataHub
What you get in DataHub
For every xarray-compatible group, a Dataset named <org>/<repo>/<group_path> with:
- Schema — one field per Zarr array. Coordinates and data variables
are distinguished via a
classificationflag in each field'sjsonProps, alongsideshape,chunk_shape,dimension_names, codecs, and the full CF attribute bag (GRIB_*keys are filtered as noise). - Description — the group's CF
title+summarywhen present, otherwise the repo description. externalUrl— direct link into the Arraylake page for that group.customPropertiesspread:- Arraylake metadata: provider, product_type, spatial/temporal coverage, spatial_resolution, update_freq, etc.
- Storage: bucket platform/name/region, computed
storage_uri. - CF group attributes: license, institution, creator/publisher, time and geospatial coverage, references, history.
- Marketplace subscription details if the repo is from a listing.
Repos with no xarray-compatible groups still emit one Dataset (repo
landing only) so every catalog entry is discoverable.
Orphan repos — catalog entries whose underlying Icechunk storage no
longer exists — are tagged with arraylake_storage_status=orphan.
Install
pip install arraylake-datahub 'acryl-datahub[datahub-rest]'
One-time platform registration
Register the earthmover custom data platform in your DataHub instance
once:
datahub put platform \
--name earthmover \
--display_name "Earthmover" \
--logo https://app.earthmover.io/icon.svg
Run
Save the following as recipe.yml:
source:
type: earthmover
config:
# token: ${ARRAYLAKE_TOKEN} # default: read from env
# api_url: https://api.earthmover.io
orgs: # omit to crawl every org the token sees
- earthmover-public
repo_pattern:
allow: [".*"]
# deny: [".*-archive$"]
env: PROD # DataHub fabric
sink:
type: datahub-rest
config:
server: http://localhost:8080
token: ${DATAHUB_GMS_TOKEN}
Then run:
export ARRAYLAKE_TOKEN=ema_xxxxxxxxxxxx
export DATAHUB_GMS_TOKEN=...
datahub ingest -c recipe.yml --preview # dry run
datahub ingest -c recipe.yml # for real
The most useful knobs are orgs (allowlist) and repo_pattern (regex
allow/deny). Full config below.
Config
| Field | Default | Notes |
|---|---|---|
token |
$ARRAYLAKE_TOKEN |
Arraylake API token (ema_*). Read-only is sufficient. |
api_url |
https://api.earthmover.io |
Arraylake catalog API base URL. |
web_url |
https://app.earthmover.io |
Used for externalUrl when a repo's web_url is missing. |
orgs |
all visible | Allowlist of org slugs. Omit to crawl every org the token sees. |
repo_pattern |
allow .* |
AllowDenyPattern matched against <org>/<repo>. |
env |
PROD |
DataHub fabric segment of the Dataset URN. |
platform |
earthmover |
Must match the platform registered above. |
walk_max_workers |
8 |
Parallel HTTP fetches per repo when walking groups. |
request_timeout_s |
30 |
|
max_retries |
3 |
Required Arraylake API access
The token needs read access to:
GET /user/orgsGET /orgs/{org}/repos/paginatedGET /repos/{org}/{repo}GET /repos/icechunk/{org}/{repo}/dataset-node
Verification
After a successful ingest, in DataHub's UI you should see:
- The Earthmover platform with logo.
- One
Datasetper xarray-compatible Zarr group, named<org>/<repo>/<group>. - A Schema panel listing every coordinate and data variable with units and CF descriptions where available.
- A clickable "View in Source" link that lands on the Arraylake page for that group, where authentication and querying happen.
Tested against acryl-datahub 0.15.x on Python 3.10–3.13.
Arraylake Support
Email — support@earthmover.io
Email us with any questions, bug reports, or feature requests.
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arraylake_datahub-0.1.1.tar.gz.
File metadata
- Download URL: arraylake_datahub-0.1.1.tar.gz
- Upload date:
- Size: 28.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33de706b0bee07b237a95ce254596b82f92989e0f3887cfa95b4925576d18a9a
|
|
| MD5 |
dc6c36d24ce963795a5f99880762da0f
|
|
| BLAKE2b-256 |
9276d363bc0d3807f1827e507e60ecc6ba30d275009d7bf0f8d676f30aa91032
|
Provenance
The following attestation bundles were made for arraylake_datahub-0.1.1.tar.gz:
Publisher:
publish.yml on earth-mover/arraylake-datahub
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arraylake_datahub-0.1.1.tar.gz -
Subject digest:
33de706b0bee07b237a95ce254596b82f92989e0f3887cfa95b4925576d18a9a - Sigstore transparency entry: 1751741549
- Sigstore integration time:
-
Permalink:
earth-mover/arraylake-datahub@524bd850ceacc7e72419fb742aab5c5c9a92ccec -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/earth-mover
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@524bd850ceacc7e72419fb742aab5c5c9a92ccec -
Trigger Event:
push
-
Statement type:
File details
Details for the file arraylake_datahub-0.1.1-py3-none-any.whl.
File metadata
- Download URL: arraylake_datahub-0.1.1-py3-none-any.whl
- Upload date:
- Size: 23.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f8ae82979f1087484d029de6ea281db46c88790525be0a44f4486ce1702a2d4
|
|
| MD5 |
a7e434cb7fb778520a372b91d34bdfeb
|
|
| BLAKE2b-256 |
684848cd29e3fd0cf69d9b26dc06f8cd6f0f6aa369f3d4ed2d235663164a108f
|
Provenance
The following attestation bundles were made for arraylake_datahub-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on earth-mover/arraylake-datahub
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arraylake_datahub-0.1.1-py3-none-any.whl -
Subject digest:
9f8ae82979f1087484d029de6ea281db46c88790525be0a44f4486ce1702a2d4 - Sigstore transparency entry: 1751741710
- Sigstore integration time:
-
Permalink:
earth-mover/arraylake-datahub@524bd850ceacc7e72419fb742aab5c5c9a92ccec -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/earth-mover
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@524bd850ceacc7e72419fb742aab5c5c9a92ccec -
Trigger Event:
push
-
Statement type: