A format-agnostic framework for cataloging and querying bioimaging data (Zarr, OME-TIFF) with Apache Iceberg

These details have not been verified by PyPI

Project links

Project description

iceberg-bioimage

iceberg-bioimage is a Python package for cataloging bioimaging metadata with Apache Iceberg and exporting Cytomining-compatible warehouse layouts.

It is designed for teams that want:

Iceberg is the control plane for cataloging, schemas, joins, and snapshots.
Cytomining-compatible Parquet warehouses are a first-class export target.
Flexible image data planes, including Zarr, OME-TIFF, and OME-Arrow-centered workflows.
Adapters that normalize source formats into a single ScanResult model.
Integration with external execution/query tools such as DuckDB, xarray, and tifffile.

Key capabilities

Scan supported source stores, including Zarr and OME-TIFF, into canonical ScanResult objects.
Summarize scanned datasets into user-facing DatasetSummary objects.
Publish image_assets and chunk_index metadata tables with PyIceberg.
Ingest one or more datasets into Cytotable-compatible Iceberg warehouses.
Export new or existing datasets into Cytomining-compatible Parquet warehouses.
Validate profile tables against the microscopy join contract.
Join scanned image metadata to profile tables through a simple top-level API.
Query canonical metadata through optional DuckDB helpers.
Load catalog-backed metadata tables into Arrow for downstream joins.

Project layout

src/iceberg_bioimage/
  __init__.py
  api.py
  cli.py
  adapters/
  integrations/
  models/
  publishing/
  validation/

Dependencies

Core runtime dependencies include:

pyarrow for Arrow/Parquet table operations
pyiceberg for catalog/table publishing
tifffile for OME-TIFF metadata scanning when OME-TIFF sources are used
zarr for Zarr metadata scanning

Optional integration groups:

duckdb for query helpers and examples
ome-arrow for Arrow-native tabular image payloads and lazy image access

Getting started

If you want a catalog-free first run, start with Cytomining export: iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr
If you want Iceberg-backed publishing, configure a PyIceberg catalog first.
For step-by-step setup, see docs/src/getting-started.md and docs/src/catalog-setup.md.

Zarr support

iceberg-bioimage keeps the user-facing API simple: use scan_store(...) for both local Zarr v2 stores and local Zarr v3 metadata stores.

Zarr v2 arrays are scanned through the zarr Python package
Local Zarr v3 stores are scanned from zarr.json metadata without requiring a separate API
Summaries report the storage variant as zarr-v2 or zarr-v3
The base package allows either Zarr 2 or Zarr 3 runtimes so that optional forward-facing integrations can coexist in the same environment

Quickstart

from iceberg_bioimage import (
    export_store_to_cytomining_warehouse,
    ingest_stores_to_warehouse,
    join_profiles_with_store,
    register_store,
    summarize_store,
    validate_microscopy_profile_table,
)

registration = register_store(
    "data/experiment.zarr",
    "default",
    "bioimage.cytotable",
)
print(registration.to_dict())

summary = summarize_store("data/experiment.zarr")
print(summary.to_dict())

contract = validate_microscopy_profile_table("data/cells.parquet")
print(contract.is_valid)

# Requires the optional DuckDB integration:
#   pip install 'iceberg-bioimage[duckdb]'
joined = join_profiles_with_store("data/experiment.zarr", "data/cells.parquet")
print(joined.num_rows)

warehouse = ingest_stores_to_warehouse(
    ["data/experiment-a.zarr", "data/experiment-b.zarr"],
    "default",
    "bioimage.cytotable",
)
print(warehouse.to_dict())

cytomining_export = export_store_to_cytomining_warehouse(
    "data/experiment-a.zarr",
    "warehouse-root",
    profiles="data/cells.parquet",
    profile_dataset_id="experiment-a",
)
print(cytomining_export.to_dict())

iceberg-bioimage scan data/experiment.zarr
iceberg-bioimage summarize data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage ingest --catalog default --namespace bioimage.cytotable data/experiment-a.zarr data/experiment-b.zarr
iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr
iceberg-bioimage publish-chunks --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable --publish-chunks data/experiment.zarr
iceberg-bioimage validate-contract data/cells.parquet
iceberg-bioimage join-profiles data/experiment.zarr data/cells.parquet --output joined.parquet

examples/quickstart.py for a minimal scan, publish, and validation script
examples/catalog_duckdb.py for a catalog-backed query workflow
examples/synthetic_workflow.py for a self-contained local workflow

Install optional integrations with:

pip install 'iceberg-bioimage[duckdb]'
pip install 'iceberg-bioimage[ome-arrow]'

DuckDB helpers

DuckDB is supported as an optional integration layer, not as a required engine. The join helpers also accept common pycytominer and coSMicQC-style Metadata_* aliases for dataset_id, image_id, plate_id, well_id, and site_id. If a profile table is missing dataset_id but all rows belong to one dataset, pass profile_dataset_id=... to the high-level join helpers.

import pyarrow as pa

from iceberg_bioimage import join_image_assets_with_profiles, query_metadata_table

image_assets = pa.table(
    {
        "dataset_id": ["ds-1"],
        "image_id": ["img-1"],
        "array_path": ["0"],
        "uri": ["data/example.zarr"],
    }
)
profiles = pa.table(
    {
        "dataset_id": ["ds-1"],
        "image_id": ["img-1"],
        "cell_count": [42],
    }
)

joined = join_image_assets_with_profiles(image_assets, profiles)
filtered = query_metadata_table(
    joined,
    filters=[("cell_count", ">", 10)],
)

Install the optional integration with uv sync --group duckdb.

Cytomining warehouse export

The package supports Cytomining interoperability as a primary workflow. Besides publishing canonical metadata to Iceberg, it can materialize a Parquet-backed warehouse root that tools like pycytominer can consume directly.

from iceberg_bioimage import export_store_to_cytomining_warehouse

result = export_store_to_cytomining_warehouse(
    "data/experiment.zarr",
    "warehouse-root",
    profiles="data/profiles.parquet",
    profile_dataset_id="experiment",
)
print(result.to_dict())

This writes one or more of:

images/image_assets/
images/chunk_index/
profiles/joined_profiles/

It can also append downstream Cytomining tables into the same warehouse root, using namespaces that match table semantics, for example:

profiles/pycytominer_profiles/
quality_control/cosmicqc_profiles/

OME-Arrow helpers

OME-Arrow is available as an optional forward-facing integration for tabular image payloads stored in Arrow-compatible formats. Projects may also choose an OME-Arrow-first workflow for source image handling.

from iceberg_bioimage import create_ome_arrow, scan_ome_arrow

oa = create_ome_arrow("image.ome.tiff")
lazy_oa = scan_ome_arrow("image.ome.parquet")

Install it with uv sync --group ome-arrow or pip install 'iceberg-bioimage[ome-arrow]'.

Local synthetic workflow

For a catalog-free onboarding path, examples/synthetic_workflow.py creates a small Zarr store and profile table, validates the join contract, derives canonical metadata rows, and joins them with the optional DuckDB helpers.

Run it with:

uv run --group duckdb python examples/synthetic_workflow.py

Catalog-backed query workflow

If you already published canonical metadata tables, you can read them from a catalog and join them to analysis outputs directly:

import pyarrow as pa

from iceberg_bioimage import join_catalog_image_assets_with_profiles

profiles = pa.table(
    {
        "dataset_id": ["ds-1"],
        "image_id": ["img-1"],
        "cell_count": [42],
    }
)

joined = join_catalog_image_assets_with_profiles(
    "default",
    "bioimage.cytotable",
    profiles,
    chunk_index_table="chunk_index",
)

Documentation

docs/src/getting-started.md for first-time setup
docs/src/catalog-setup.md for catalog configuration
docs/src/cytomining.md for warehouse export workflows
docs/src/warehouse-spec.md for the warehouse interoperability specification
docs/src/workflow.md for CLI-driven end-to-end examples

Troubleshooting

DuckDB helpers require the optional duckdb dependency group: install with pip install 'iceberg-bioimage[duckdb]' or uv sync --group duckdb.
Profiles do not satisfy the microscopy join contract: run iceberg-bioimage validate-contract ... and pass --profile-dataset-id when dataset_id is missing but implied.
Missing table: ... for catalog-backed paths: verify catalog configuration, namespace, and table names.

Architecture note

The package focuses on metadata scanning, publishing, Cytomining warehouse export, validation, and joins. OME-Arrow remains the place for Arrow-native image payload handling and lazy image access.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.3

Apr 21, 2026

0.0.1

Apr 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iceberg_bioimage-0.0.3.tar.gz (528.9 kB view details)

Uploaded Apr 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

iceberg_bioimage-0.0.3-py3-none-any.whl (37.8 kB view details)

Uploaded Apr 21, 2026 Python 3

File details

Details for the file iceberg_bioimage-0.0.3.tar.gz.

File metadata

Download URL: iceberg_bioimage-0.0.3.tar.gz
Upload date: Apr 21, 2026
Size: 528.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for iceberg_bioimage-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`b17f9c1447d2db9497bc74e8aa577430dd84649f3116fb2460aca5c480bf5c35`
MD5	`f6e0a18b4042fb2c981ea1d08f16e4d2`
BLAKE2b-256	`e20aead0110545b29d0b1eff71ae9c6790806e1ddcb2fced4559758d16eb725a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for iceberg_bioimage-0.0.3.tar.gz:

Publisher: publish-pypi.yml on WayScience/iceberg-bioimage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: iceberg_bioimage-0.0.3.tar.gz
- Subject digest: b17f9c1447d2db9497bc74e8aa577430dd84649f3116fb2460aca5c480bf5c35
- Sigstore transparency entry: 1353558940
- Sigstore integration time: Apr 21, 2026
Source repository:
- Permalink: WayScience/iceberg-bioimage@071e6d2d0e92f929fe2c889a9b7e6532e16ff0a7
- Branch / Tag: refs/tags/v0.0.3
- Owner: https://github.com/WayScience
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@071e6d2d0e92f929fe2c889a9b7e6532e16ff0a7
- Trigger Event: release

File details

Details for the file iceberg_bioimage-0.0.3-py3-none-any.whl.

File metadata

Download URL: iceberg_bioimage-0.0.3-py3-none-any.whl
Upload date: Apr 21, 2026
Size: 37.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for iceberg_bioimage-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5190f20ca925106e104aebb6384699baf8b17928f7d62d11263f5b612d7ba3cb`
MD5	`f784c2f9898a1c73cbb05eed063da5a4`
BLAKE2b-256	`84b41d72619d96790325b102552dc76657a8ff683a2d6d428878f6ff756f396e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for iceberg_bioimage-0.0.3-py3-none-any.whl:

Publisher: publish-pypi.yml on WayScience/iceberg-bioimage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: iceberg_bioimage-0.0.3-py3-none-any.whl
- Subject digest: 5190f20ca925106e104aebb6384699baf8b17928f7d62d11263f5b612d7ba3cb
- Sigstore transparency entry: 1353559060
- Sigstore integration time: Apr 21, 2026
Source repository:
- Permalink: WayScience/iceberg-bioimage@071e6d2d0e92f929fe2c889a9b7e6532e16ff0a7
- Branch / Tag: refs/tags/v0.0.3
- Owner: https://github.com/WayScience
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@071e6d2d0e92f929fe2c889a9b7e6532e16ff0a7
- Trigger Event: release

iceberg-bioimage 0.0.3

Navigation

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Project description

iceberg-bioimage

Key capabilities

Project layout

Dependencies

Getting started

Zarr support

Quickstart

DuckDB helpers

Cytomining warehouse export

OME-Arrow helpers

Local synthetic workflow

Catalog-backed query workflow

Documentation

Troubleshooting

Architecture note

Project details

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance