Skip to main content

DataHub ingestion source for Arraylake (Earthmover) Icechunk datasets

Project description

04-Arraylake-Lockup-Midnight-RGB-SMALL

arraylake-datahub

DataHub ingestion source for Arraylake / Icechunk datasets.

The plugin crawls the Arraylake catalog over HTTPS and emits one DataHub Dataset per xarray-compatible Zarr group in your repos. Access control, credential vending, and querying stay in Arraylake — DataHub becomes the discovery and metadata-search surface, with externalUrl links back into the Arraylake web app for the actual data.

📖 Full documentation: Arraylake docs › Integrations › Catalogs › DataHub

What you get in DataHub

For every xarray-compatible group, a Dataset named <org>/<repo>/<group_path> with:

  • Schema — one field per Zarr array. Coordinates and data variables are distinguished via a classification flag in each field's jsonProps, alongside shape, chunk_shape, dimension_names, codecs, and the full CF attribute bag (GRIB_* keys are filtered as noise).
  • Description — the group's CF title + summary when present, otherwise the repo description.
  • externalUrl — direct link into the Arraylake page for that group.
  • customProperties spread:
    • Arraylake metadata: provider, product_type, spatial/temporal coverage, spatial_resolution, update_freq, etc.
    • Storage: bucket platform/name/region, computed storage_uri.
    • CF group attributes: license, institution, creator/publisher, time and geospatial coverage, references, history.
    • Marketplace subscription details if the repo is from a listing.

Repos with no xarray-compatible groups still emit one Dataset (repo landing only) so every catalog entry is discoverable.

Orphan repos — catalog entries whose underlying Icechunk storage no longer exists — are tagged with arraylake_storage_status=orphan.

Install

pip install arraylake-datahub 'acryl-datahub[datahub-rest]'

One-time platform registration

Register the earthmover custom data platform in your DataHub instance once:

datahub put platform \
  --name earthmover \
  --display_name "Earthmover" \
  --logo https://app.earthmover.io/icon.svg

Run

Save the following as recipe.yml:

source:
  type: earthmover
  config:
    # token: ${ARRAYLAKE_TOKEN}      # default: read from env
    # api_url: https://api.earthmover.io
    orgs:                            # omit to crawl every org the token sees
      - earthmover-public
    repo_pattern:
      allow: [".*"]
      # deny: [".*-archive$"]
    env: PROD                        # DataHub fabric

sink:
  type: datahub-rest
  config:
    server: http://localhost:8080
    token: ${DATAHUB_GMS_TOKEN}

Then run:

export ARRAYLAKE_TOKEN=ema_xxxxxxxxxxxx
export DATAHUB_GMS_TOKEN=...

datahub ingest -c recipe.yml --preview   # dry run
datahub ingest -c recipe.yml             # for real

The most useful knobs are orgs (allowlist) and repo_pattern (regex allow/deny). Full config below.

Config

Field Default Notes
token $ARRAYLAKE_TOKEN Arraylake API token (ema_*). Read-only is sufficient.
api_url https://api.earthmover.io Arraylake catalog API base URL.
web_url https://app.earthmover.io Used for externalUrl when a repo's web_url is missing.
orgs all visible Allowlist of org slugs. Omit to crawl every org the token sees.
repo_pattern allow .* AllowDenyPattern matched against <org>/<repo>.
env PROD DataHub fabric segment of the Dataset URN.
platform earthmover Must match the platform registered above.
walk_max_workers 8 Parallel HTTP fetches per repo when walking groups.
request_timeout_s 30
max_retries 3

Required Arraylake API access

The token needs read access to:

  • GET /user/orgs
  • GET /orgs/{org}/repos/paginated
  • GET /repos/{org}/{repo}
  • GET /repos/icechunk/{org}/{repo}/dataset-node

Verification

After a successful ingest, in DataHub's UI you should see:

  • The Earthmover platform with logo.
  • One Dataset per xarray-compatible Zarr group, named <org>/<repo>/<group>.
  • A Schema panel listing every coordinate and data variable with units and CF descriptions where available.
  • A clickable "View in Source" link that lands on the Arraylake page for that group, where authentication and querying happen.

Tested against acryl-datahub 0.15.x on Python 3.10–3.13.

Arraylake Support

Emailsupport@earthmover.io

Email us with any questions, bug reports, or feature requests.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arraylake_datahub-0.1.0.tar.gz (27.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arraylake_datahub-0.1.0-py3-none-any.whl (23.1 kB view details)

Uploaded Python 3

File details

Details for the file arraylake_datahub-0.1.0.tar.gz.

File metadata

  • Download URL: arraylake_datahub-0.1.0.tar.gz
  • Upload date:
  • Size: 27.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for arraylake_datahub-0.1.0.tar.gz
Algorithm Hash digest
SHA256 777c3baa92a4c0de2edf83215629b449d9a520a8d7c55f11a21c0652bc477d7f
MD5 049b4321389c6a774ed66294b37e6d87
BLAKE2b-256 5eb205ebff9681889e0e6f52b08cce4432a64b9d683b6aa379b0cc29e1becaa9

See more details on using hashes here.

Provenance

The following attestation bundles were made for arraylake_datahub-0.1.0.tar.gz:

Publisher: publish.yml on earth-mover/arraylake-datahub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arraylake_datahub-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for arraylake_datahub-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9e70f4d24360c38557169c4e681749af98e2c7c21afad7dcaf89d422b1c0c408
MD5 1cc5ca83059194988a887a8869bf8f79
BLAKE2b-256 28cb3c6e066fe8f126b2b04acb76ebbd513ccca52dc2d68fe8dea99d2baeb686

See more details on using hashes here.

Provenance

The following attestation bundles were made for arraylake_datahub-0.1.0-py3-none-any.whl:

Publisher: publish.yml on earth-mover/arraylake-datahub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page