Skip to main content

A format-agnostic framework for cataloging and querying bioimaging data (Zarr, OME-TIFF) with Apache Iceberg

Project description

iceberg-bioimage logo

iceberg-bioimage

iceberg-bioimage is a Python package for cataloging bioimaging metadata with Apache Iceberg and exporting Cytomining-compatible warehouse layouts.

It is designed for teams that want:

  • Iceberg is the control plane for cataloging, schemas, joins, and snapshots.
  • Cytomining-compatible Parquet warehouses are a first-class export target.
  • Flexible image data planes, including Zarr, OME-TIFF, and OME-Arrow-centered workflows.
  • Adapters that normalize source formats into a single ScanResult model.
  • Integration with external execution/query tools such as DuckDB, xarray, and tifffile.

Key capabilities

  • Scan supported source stores, including Zarr and OME-TIFF, into canonical ScanResult objects.
  • Summarize scanned datasets into user-facing DatasetSummary objects.
  • Publish image_assets and chunk_index metadata tables with PyIceberg.
  • Ingest one or more datasets into Cytotable-compatible Iceberg warehouses.
  • Export new or existing datasets into Cytomining-compatible Parquet warehouses.
  • Validate profile tables against the microscopy join contract.
  • Join scanned image metadata to profile tables through a simple top-level API.
  • Query canonical metadata through optional DuckDB helpers.
  • Load catalog-backed metadata tables into Arrow for downstream joins.

Project layout

src/iceberg_bioimage/
  __init__.py
  api.py
  cli.py
  adapters/
  integrations/
  models/
  publishing/
  validation/

Dependencies

Core runtime dependencies include:

  • pyarrow for Arrow/Parquet table operations
  • pyiceberg for catalog/table publishing
  • tifffile for OME-TIFF metadata scanning when OME-TIFF sources are used
  • zarr for Zarr metadata scanning

Optional integration groups:

  • duckdb for query helpers and examples
  • ome-arrow for Arrow-native tabular image payloads and lazy image access

Getting started

  • If you want a catalog-free first run, start with Cytomining export: iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr
  • If you want Iceberg-backed publishing, configure a PyIceberg catalog first.
  • For step-by-step setup, see docs/src/getting-started.md and docs/src/catalog-setup.md.

Zarr support

iceberg-bioimage keeps the user-facing API simple: use scan_store(...) for both local Zarr v2 stores and local Zarr v3 metadata stores.

  • Zarr v2 arrays are scanned through the zarr Python package
  • Local Zarr v3 stores are scanned from zarr.json metadata without requiring a separate API
  • Summaries report the storage variant as zarr-v2 or zarr-v3
  • The base package allows either Zarr 2 or Zarr 3 runtimes so that optional forward-facing integrations can coexist in the same environment

Quickstart

from iceberg_bioimage import (
    export_store_to_cytomining_warehouse,
    ingest_stores_to_warehouse,
    join_profiles_with_store,
    register_store,
    summarize_store,
    validate_microscopy_profile_table,
)

registration = register_store(
    "data/experiment.zarr",
    "default",
    "bioimage.cytotable",
)
print(registration.to_dict())

summary = summarize_store("data/experiment.zarr")
print(summary.to_dict())

contract = validate_microscopy_profile_table("data/cells.parquet")
print(contract.is_valid)

# Requires the optional DuckDB integration:
#   pip install 'iceberg-bioimage[duckdb]'
joined = join_profiles_with_store("data/experiment.zarr", "data/cells.parquet")
print(joined.num_rows)

warehouse = ingest_stores_to_warehouse(
    ["data/experiment-a.zarr", "data/experiment-b.zarr"],
    "default",
    "bioimage.cytotable",
)
print(warehouse.to_dict())

cytomining_export = export_store_to_cytomining_warehouse(
    "data/experiment-a.zarr",
    "warehouse-root",
    profiles="data/cells.parquet",
    profile_dataset_id="experiment-a",
)
print(cytomining_export.to_dict())
iceberg-bioimage scan data/experiment.zarr
iceberg-bioimage summarize data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage ingest --catalog default --namespace bioimage.cytotable data/experiment-a.zarr data/experiment-b.zarr
iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr
iceberg-bioimage publish-chunks --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable --publish-chunks data/experiment.zarr
iceberg-bioimage validate-contract data/cells.parquet
iceberg-bioimage join-profiles data/experiment.zarr data/cells.parquet --output joined.parquet
  • examples/quickstart.py for a minimal scan, publish, and validation script
  • examples/catalog_duckdb.py for a catalog-backed query workflow
  • examples/synthetic_workflow.py for a self-contained local workflow

Install optional integrations with:

pip install 'iceberg-bioimage[duckdb]'
pip install 'iceberg-bioimage[ome-arrow]'

DuckDB helpers

DuckDB is supported as an optional integration layer, not as a required engine. The join helpers also accept common pycytominer and coSMicQC-style Metadata_* aliases for dataset_id, image_id, plate_id, well_id, and site_id. If a profile table is missing dataset_id but all rows belong to one dataset, pass profile_dataset_id=... to the high-level join helpers.

import pyarrow as pa

from iceberg_bioimage import join_image_assets_with_profiles, query_metadata_table

image_assets = pa.table(
    {
        "dataset_id": ["ds-1"],
        "image_id": ["img-1"],
        "array_path": ["0"],
        "uri": ["data/example.zarr"],
    }
)
profiles = pa.table(
    {
        "dataset_id": ["ds-1"],
        "image_id": ["img-1"],
        "cell_count": [42],
    }
)

joined = join_image_assets_with_profiles(image_assets, profiles)
filtered = query_metadata_table(
    joined,
    filters=[("cell_count", ">", 10)],
)

Install the optional integration with uv sync --group duckdb.

Cytomining warehouse export

The package supports Cytomining interoperability as a primary workflow. Besides publishing canonical metadata to Iceberg, it can materialize a Parquet-backed warehouse root that tools like pycytominer can consume directly.

from iceberg_bioimage import export_store_to_cytomining_warehouse

result = export_store_to_cytomining_warehouse(
    "data/experiment.zarr",
    "warehouse-root",
    profiles="data/profiles.parquet",
    profile_dataset_id="experiment",
)
print(result.to_dict())

This writes one or more of:

  • images/image_assets/
  • images/chunk_index/
  • profiles/joined_profiles/

It can also append downstream Cytomining tables into the same warehouse root, using namespaces that match table semantics, for example:

  • profiles/pycytominer_profiles/
  • quality_control/cosmicqc_profiles/

OME-Arrow helpers

OME-Arrow is available as an optional forward-facing integration for tabular image payloads stored in Arrow-compatible formats. Projects may also choose an OME-Arrow-first workflow for source image handling.

from iceberg_bioimage import create_ome_arrow, scan_ome_arrow

oa = create_ome_arrow("image.ome.tiff")
lazy_oa = scan_ome_arrow("image.ome.parquet")

Install it with uv sync --group ome-arrow or pip install 'iceberg-bioimage[ome-arrow]'.

Local synthetic workflow

For a catalog-free onboarding path, examples/synthetic_workflow.py creates a small Zarr store and profile table, validates the join contract, derives canonical metadata rows, and joins them with the optional DuckDB helpers.

Run it with:

uv run --group duckdb python examples/synthetic_workflow.py

Catalog-backed query workflow

If you already published canonical metadata tables, you can read them from a catalog and join them to analysis outputs directly:

import pyarrow as pa

from iceberg_bioimage import join_catalog_image_assets_with_profiles

profiles = pa.table(
    {
        "dataset_id": ["ds-1"],
        "image_id": ["img-1"],
        "cell_count": [42],
    }
)

joined = join_catalog_image_assets_with_profiles(
    "default",
    "bioimage.cytotable",
    profiles,
    chunk_index_table="chunk_index",
)

Documentation

  • docs/src/getting-started.md for first-time setup
  • docs/src/catalog-setup.md for catalog configuration
  • docs/src/cytomining.md for warehouse export workflows
  • docs/src/warehouse-spec.md for the warehouse interoperability specification
  • docs/src/workflow.md for CLI-driven end-to-end examples

Troubleshooting

  • DuckDB helpers require the optional duckdb dependency group: install with pip install 'iceberg-bioimage[duckdb]' or uv sync --group duckdb.
  • Profiles do not satisfy the microscopy join contract: run iceberg-bioimage validate-contract ... and pass --profile-dataset-id when dataset_id is missing but implied.
  • Missing table: ... for catalog-backed paths: verify catalog configuration, namespace, and table names.

Architecture note

The package focuses on metadata scanning, publishing, Cytomining warehouse export, validation, and joins. OME-Arrow remains the place for Arrow-native image payload handling and lazy image access.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iceberg_bioimage-0.0.1.tar.gz (46.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iceberg_bioimage-0.0.1-py3-none-any.whl (37.1 kB view details)

Uploaded Python 3

File details

Details for the file iceberg_bioimage-0.0.1.tar.gz.

File metadata

  • Download URL: iceberg_bioimage-0.0.1.tar.gz
  • Upload date:
  • Size: 46.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for iceberg_bioimage-0.0.1.tar.gz
Algorithm Hash digest
SHA256 9e35c4c5f6463ffd3ebbbd56e9871b91217908a5f88018924f5557877a3a61d7
MD5 2b962c44335c35c0228f7750198a891d
BLAKE2b-256 acfdc6ec4ebe00b513d6e6e5f251915bcc8733cae7cd0b99ef72d6760b70ae24

See more details on using hashes here.

Provenance

The following attestation bundles were made for iceberg_bioimage-0.0.1.tar.gz:

Publisher: publish-pypi.yml on WayScience/iceberg-bioimage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file iceberg_bioimage-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for iceberg_bioimage-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2ed603b8b689554761377bac092a965e74a64d3017b90f12a648902a6ee39165
MD5 3defd527858f4e6139fd1e37ebefa2fe
BLAKE2b-256 3b180167e9d5fb4fa6621f8334f611a413251e41a382234bb6cd81ec5725fa27

See more details on using hashes here.

Provenance

The following attestation bundles were made for iceberg_bioimage-0.0.1-py3-none-any.whl:

Publisher: publish-pypi.yml on WayScience/iceberg-bioimage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page