Skip to main content

A format-agnostic framework for cataloging and querying bioimaging data (Zarr, OME-TIFF) with Apache Iceberg

Project description

iceberg-bioimage logo

iceberg-bioimage

Software DOI badge PyPI - Version Build Status Ruff uv

iceberg-bioimage is a Python package for cataloging bioimaging metadata with Apache Iceberg and exporting Cytomining-compatible warehouse layouts.

It is designed for teams that want:

  • Iceberg is the control plane for cataloging, schemas, joins, and snapshots.
  • Cytomining-compatible Parquet warehouses are a first-class export target.
  • Flexible image data planes, including Zarr, OME-TIFF, and OME-Arrow-centered workflows.
  • Adapters that normalize source formats into a single ScanResult model.
  • Integration with external execution/query tools such as DuckDB, xarray, and tifffile.

Key capabilities

  • Scan supported source stores, including Zarr and OME-TIFF, into canonical ScanResult objects.
  • Summarize scanned datasets into user-facing DatasetSummary objects.
  • Publish image_assets and chunk_index metadata tables with PyIceberg.
  • Ingest one or more datasets into Cytotable-compatible Iceberg warehouses.
  • Export new or existing datasets into Cytomining-compatible Parquet warehouses.
  • Validate profile tables against the microscopy join contract.
  • Join scanned image metadata to profile tables through a simple top-level API.
  • Query canonical metadata through optional DuckDB helpers.
  • Load catalog-backed metadata tables into Arrow for downstream joins.

Project layout

src/iceberg_bioimage/
  __init__.py
  api.py
  cli.py
  adapters/
  integrations/
  models/
  publishing/
  validation/

Dependencies

Core runtime dependencies include:

  • pyarrow for Arrow/Parquet table operations
  • pyiceberg for catalog/table publishing
  • tifffile for OME-TIFF metadata scanning when OME-TIFF sources are used
  • zarr for Zarr metadata scanning

Optional integration groups:

  • duckdb for query helpers and examples
  • ome-arrow for Arrow-native tabular image payloads and lazy image access

Getting started

  • If you want a catalog-free first run, start with Cytomining export: iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr
  • If you want Iceberg-backed publishing, configure a PyIceberg catalog first.
  • For step-by-step setup, see docs/src/getting-started.md and docs/src/catalog-setup.md.

Zarr support

iceberg-bioimage keeps the user-facing API simple: use scan_store(...) for both local Zarr v2 stores and local Zarr v3 metadata stores.

  • Zarr v2 arrays are scanned through the zarr Python package
  • Local Zarr v3 stores are scanned from zarr.json metadata without requiring a separate API
  • Summaries report the storage variant as zarr-v2 or zarr-v3
  • The base package allows either Zarr 2 or Zarr 3 runtimes so that optional forward-facing integrations can coexist in the same environment

Quickstart

from iceberg_bioimage import (
    export_store_to_cytomining_warehouse,
    ingest_stores_to_warehouse,
    join_profiles_with_store,
    register_store,
    summarize_store,
    validate_microscopy_profile_table,
)

registration = register_store(
    "data/experiment.zarr",
    "default",
    "bioimage.cytotable",
)
print(registration.to_dict())

summary = summarize_store("data/experiment.zarr")
print(summary.to_dict())

contract = validate_microscopy_profile_table("data/cells.parquet")
print(contract.is_valid)

# Requires the optional DuckDB integration:
#   pip install 'iceberg-bioimage[duckdb]'
joined = join_profiles_with_store("data/experiment.zarr", "data/cells.parquet")
print(joined.num_rows)

warehouse = ingest_stores_to_warehouse(
    ["data/experiment-a.zarr", "data/experiment-b.zarr"],
    "default",
    "bioimage.cytotable",
)
print(warehouse.to_dict())

cytomining_export = export_store_to_cytomining_warehouse(
    "data/experiment-a.zarr",
    "warehouse-root",
    profiles="data/cells.parquet",
    profile_dataset_id="experiment-a",
)
print(cytomining_export.to_dict())
iceberg-bioimage scan data/experiment.zarr
iceberg-bioimage summarize data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage ingest --catalog default --namespace bioimage.cytotable data/experiment-a.zarr data/experiment-b.zarr
iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr
iceberg-bioimage publish-chunks --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable --publish-chunks data/experiment.zarr
iceberg-bioimage validate-contract data/cells.parquet
iceberg-bioimage join-profiles data/experiment.zarr data/cells.parquet --output joined.parquet
  • examples/quickstart.py for a minimal scan, publish, and validation script
  • examples/catalog_duckdb.py for a catalog-backed query workflow
  • examples/synthetic_workflow.py for a self-contained local workflow

Install optional integrations with:

pip install 'iceberg-bioimage[duckdb]'
pip install 'iceberg-bioimage[ome-arrow]'

DuckDB helpers

DuckDB is supported as an optional integration layer, not as a required engine. The join helpers also accept common pycytominer and coSMicQC-style Metadata_* aliases for dataset_id, image_id, plate_id, well_id, and site_id. If a profile table is missing dataset_id but all rows belong to one dataset, pass profile_dataset_id=... to the high-level join helpers.

import pyarrow as pa

from iceberg_bioimage import join_image_assets_with_profiles, query_metadata_table

image_assets = pa.table(
    {
        "dataset_id": ["ds-1"],
        "image_id": ["img-1"],
        "array_path": ["0"],
        "uri": ["data/example.zarr"],
    }
)
profiles = pa.table(
    {
        "dataset_id": ["ds-1"],
        "image_id": ["img-1"],
        "cell_count": [42],
    }
)

joined = join_image_assets_with_profiles(image_assets, profiles)
filtered = query_metadata_table(
    joined,
    filters=[("cell_count", ">", 10)],
)

Install the optional integration with uv sync --group duckdb.

Cytomining warehouse export

The package supports Cytomining interoperability as a primary workflow. Besides publishing canonical metadata to Iceberg, it can materialize a Parquet-backed warehouse root that tools like pycytominer can consume directly.

from iceberg_bioimage import export_store_to_cytomining_warehouse

result = export_store_to_cytomining_warehouse(
    "data/experiment.zarr",
    "warehouse-root",
    profiles="data/profiles.parquet",
    profile_dataset_id="experiment",
)
print(result.to_dict())

This writes one or more of:

  • images/image_assets/
  • images/chunk_index/
  • profiles/joined_profiles/

It can also append downstream Cytomining tables into the same warehouse root, using namespaces that match table semantics, for example:

  • profiles/pycytominer_profiles/
  • quality_control/cosmicqc_profiles/

OME-Arrow helpers

OME-Arrow is available as an optional forward-facing integration for tabular image payloads stored in Arrow-compatible formats. Projects may also choose an OME-Arrow-first workflow for source image handling.

from iceberg_bioimage import create_ome_arrow, scan_ome_arrow

oa = create_ome_arrow("image.ome.tiff")
lazy_oa = scan_ome_arrow("image.ome.parquet")

Install it with uv sync --group ome-arrow or pip install 'iceberg-bioimage[ome-arrow]'.

Local synthetic workflow

For a catalog-free onboarding path, examples/synthetic_workflow.py creates a small Zarr store and profile table, validates the join contract, derives canonical metadata rows, and joins them with the optional DuckDB helpers.

Run it with:

uv run --group duckdb python examples/synthetic_workflow.py

Catalog-backed query workflow

If you already published canonical metadata tables, you can read them from a catalog and join them to analysis outputs directly:

import pyarrow as pa

from iceberg_bioimage import join_catalog_image_assets_with_profiles

profiles = pa.table(
    {
        "dataset_id": ["ds-1"],
        "image_id": ["img-1"],
        "cell_count": [42],
    }
)

joined = join_catalog_image_assets_with_profiles(
    "default",
    "bioimage.cytotable",
    profiles,
    chunk_index_table="chunk_index",
)

Documentation

  • docs/src/getting-started.md for first-time setup
  • docs/src/catalog-setup.md for catalog configuration
  • docs/src/cytomining.md for warehouse export workflows
  • docs/src/warehouse-spec.md for the warehouse interoperability specification
  • docs/src/workflow.md for CLI-driven end-to-end examples

Troubleshooting

  • DuckDB helpers require the optional duckdb dependency group: install with pip install 'iceberg-bioimage[duckdb]' or uv sync --group duckdb.
  • Profiles do not satisfy the microscopy join contract: run iceberg-bioimage validate-contract ... and pass --profile-dataset-id when dataset_id is missing but implied.
  • Missing table: ... for catalog-backed paths: verify catalog configuration, namespace, and table names.

Architecture note

The package focuses on metadata scanning, publishing, Cytomining warehouse export, validation, and joins. OME-Arrow remains the place for Arrow-native image payload handling and lazy image access.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iceberg_bioimage-0.0.3.tar.gz (528.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iceberg_bioimage-0.0.3-py3-none-any.whl (37.8 kB view details)

Uploaded Python 3

File details

Details for the file iceberg_bioimage-0.0.3.tar.gz.

File metadata

  • Download URL: iceberg_bioimage-0.0.3.tar.gz
  • Upload date:
  • Size: 528.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for iceberg_bioimage-0.0.3.tar.gz
Algorithm Hash digest
SHA256 b17f9c1447d2db9497bc74e8aa577430dd84649f3116fb2460aca5c480bf5c35
MD5 f6e0a18b4042fb2c981ea1d08f16e4d2
BLAKE2b-256 e20aead0110545b29d0b1eff71ae9c6790806e1ddcb2fced4559758d16eb725a

See more details on using hashes here.

Provenance

The following attestation bundles were made for iceberg_bioimage-0.0.3.tar.gz:

Publisher: publish-pypi.yml on WayScience/iceberg-bioimage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file iceberg_bioimage-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for iceberg_bioimage-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5190f20ca925106e104aebb6384699baf8b17928f7d62d11263f5b612d7ba3cb
MD5 f784c2f9898a1c73cbb05eed063da5a4
BLAKE2b-256 84b41d72619d96790325b102552dc76657a8ff683a2d6d428878f6ff756f396e

See more details on using hashes here.

Provenance

The following attestation bundles were made for iceberg_bioimage-0.0.3-py3-none-any.whl:

Publisher: publish-pypi.yml on WayScience/iceberg-bioimage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page