Skip to main content

DataHub ingestion source for Arraylake (Earthmover) Icechunk datasets

Project description

Arraylake

arraylake-datahub

DataHub ingestion source for Arraylake / Icechunk datasets.

The plugin crawls the Arraylake catalog over HTTPS and emits one DataHub Dataset per xarray-compatible Zarr group in your repos. Access control, credential vending, and querying stay in Arraylake — DataHub becomes the discovery and metadata-search surface, with externalUrl links back into the Arraylake web app for the actual data.

📖 Full documentation: Arraylake docs › Integrations › Catalogs › DataHub

What you get in DataHub

For every xarray-compatible group, a Dataset named <org>/<repo>/<group_path> with:

  • Schema — one field per Zarr array. Coordinates and data variables are distinguished via a classification flag in each field's jsonProps, alongside shape, chunk_shape, dimension_names, codecs, and the full CF attribute bag (GRIB_* keys are filtered as noise).
  • Description — the group's CF title + summary when present, otherwise the repo description.
  • externalUrl — direct link into the Arraylake page for that group.
  • customProperties spread:
    • Arraylake metadata: provider, product_type, spatial/temporal coverage, spatial_resolution, update_freq, etc.
    • Storage: bucket platform/name/region, computed storage_uri.
    • CF group attributes: license, institution, creator/publisher, time and geospatial coverage, references, history.
    • Marketplace subscription details if the repo is from a listing.

Repos with no xarray-compatible groups still emit one Dataset (repo landing only) so every catalog entry is discoverable.

Orphan repos — catalog entries whose underlying Icechunk storage no longer exists — are tagged with arraylake_storage_status=orphan.

Install

pip install arraylake-datahub 'acryl-datahub[datahub-rest]'

One-time platform registration

Register the earthmover custom data platform in your DataHub instance once:

datahub put platform \
  --name earthmover \
  --display_name "Earthmover" \
  --logo https://app.earthmover.io/icon.svg

Run

Save the following as recipe.yml:

source:
  type: earthmover
  config:
    # token: ${ARRAYLAKE_TOKEN}      # default: read from env
    # api_url: https://api.earthmover.io
    orgs:                            # omit to crawl every org the token sees
      - earthmover-public
    repo_pattern:
      allow: [".*"]
      # deny: [".*-archive$"]
    env: PROD                        # DataHub fabric

sink:
  type: datahub-rest
  config:
    server: http://localhost:8080
    token: ${DATAHUB_GMS_TOKEN}

Then run:

export ARRAYLAKE_TOKEN=ema_xxxxxxxxxxxx
export DATAHUB_GMS_TOKEN=...

datahub ingest -c recipe.yml --preview   # dry run
datahub ingest -c recipe.yml             # for real

The most useful knobs are orgs (allowlist) and repo_pattern (regex allow/deny). Full config below.

Config

Field Default Notes
token $ARRAYLAKE_TOKEN Arraylake API token (ema_*). Read-only is sufficient.
api_url https://api.earthmover.io Arraylake catalog API base URL.
web_url https://app.earthmover.io Used for externalUrl when a repo's web_url is missing.
orgs all visible Allowlist of org slugs. Omit to crawl every org the token sees.
repo_pattern allow .* AllowDenyPattern matched against <org>/<repo>.
env PROD DataHub fabric segment of the Dataset URN.
platform earthmover Must match the platform registered above.
walk_max_workers 8 Parallel HTTP fetches per repo when walking groups.
request_timeout_s 30
max_retries 3

Required Arraylake API access

The token needs read access to:

  • GET /user/orgs
  • GET /orgs/{org}/repos/paginated
  • GET /repos/{org}/{repo}
  • GET /repos/icechunk/{org}/{repo}/dataset-node

Verification

After a successful ingest, in DataHub's UI you should see:

  • The Earthmover platform with logo.
  • One Dataset per xarray-compatible Zarr group, named <org>/<repo>/<group>.
  • A Schema panel listing every coordinate and data variable with units and CF descriptions where available.
  • A clickable "View in Source" link that lands on the Arraylake page for that group, where authentication and querying happen.

Tested against acryl-datahub 0.15.x on Python 3.10–3.13.

Arraylake Support

Emailsupport@earthmover.io

Email us with any questions, bug reports, or feature requests.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arraylake_datahub-0.1.1.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arraylake_datahub-0.1.1-py3-none-any.whl (23.6 kB view details)

Uploaded Python 3

File details

Details for the file arraylake_datahub-0.1.1.tar.gz.

File metadata

  • Download URL: arraylake_datahub-0.1.1.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for arraylake_datahub-0.1.1.tar.gz
Algorithm Hash digest
SHA256 33de706b0bee07b237a95ce254596b82f92989e0f3887cfa95b4925576d18a9a
MD5 dc6c36d24ce963795a5f99880762da0f
BLAKE2b-256 9276d363bc0d3807f1827e507e60ecc6ba30d275009d7bf0f8d676f30aa91032

See more details on using hashes here.

Provenance

The following attestation bundles were made for arraylake_datahub-0.1.1.tar.gz:

Publisher: publish.yml on earth-mover/arraylake-datahub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arraylake_datahub-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for arraylake_datahub-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9f8ae82979f1087484d029de6ea281db46c88790525be0a44f4486ce1702a2d4
MD5 a7e434cb7fb778520a372b91d34bdfeb
BLAKE2b-256 684848cd29e3fd0cf69d9b26dc06f8cd6f0f6aa369f3d4ed2d235663164a108f

See more details on using hashes here.

Provenance

The following attestation bundles were made for arraylake_datahub-0.1.1-py3-none-any.whl:

Publisher: publish.yml on earth-mover/arraylake-datahub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page