A format-agnostic framework for cataloging and querying bioimaging data (Zarr, OME-TIFF) with Apache Iceberg
Project description
iceberg-bioimage
iceberg-bioimage is a Python package for cataloging bioimaging metadata with Apache Iceberg and exporting Cytomining-compatible warehouse layouts.
It is designed for teams that want:
- Iceberg is the control plane for cataloging, schemas, joins, and snapshots.
- Cytomining-compatible Parquet warehouses are a first-class export target.
- Flexible image data planes, including Zarr, OME-TIFF, and OME-Arrow-centered workflows.
- Adapters that normalize source formats into a single
ScanResultmodel. - Integration with external execution/query tools such as DuckDB, xarray, and tifffile.
Key capabilities
- Scan supported source stores, including Zarr and OME-TIFF, into canonical
ScanResultobjects. - Summarize scanned datasets into user-facing
DatasetSummaryobjects. - Publish
image_assetsandchunk_indexmetadata tables with PyIceberg. - Ingest one or more datasets into Cytotable-compatible Iceberg warehouses.
- Export new or existing datasets into Cytomining-compatible Parquet warehouses.
- Validate profile tables against the microscopy join contract.
- Join scanned image metadata to profile tables through a simple top-level API.
- Query canonical metadata through optional DuckDB helpers.
- Load catalog-backed metadata tables into Arrow for downstream joins.
Project layout
src/iceberg_bioimage/
__init__.py
api.py
cli.py
adapters/
integrations/
models/
publishing/
validation/
Dependencies
Core runtime dependencies include:
pyarrowfor Arrow/Parquet table operationspyicebergfor catalog/table publishingtifffilefor OME-TIFF metadata scanning when OME-TIFF sources are usedzarrfor Zarr metadata scanning
Optional integration groups:
duckdbfor query helpers and examplesome-arrowfor Arrow-native tabular image payloads and lazy image access
Getting started
- If you want a catalog-free first run, start with Cytomining export:
iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr - If you want Iceberg-backed publishing, configure a PyIceberg catalog first.
- For step-by-step setup, see
docs/src/getting-started.mdanddocs/src/catalog-setup.md.
Zarr support
iceberg-bioimage keeps the user-facing API simple: use scan_store(...) for
both local Zarr v2 stores and local Zarr v3 metadata stores.
- Zarr v2 arrays are scanned through the
zarrPython package - Local Zarr v3 stores are scanned from
zarr.jsonmetadata without requiring a separate API - Summaries report the storage variant as
zarr-v2orzarr-v3 - The base package allows either Zarr 2 or Zarr 3 runtimes so that optional forward-facing integrations can coexist in the same environment
Quickstart
from iceberg_bioimage import (
export_store_to_cytomining_warehouse,
ingest_stores_to_warehouse,
join_profiles_with_store,
register_store,
summarize_store,
validate_microscopy_profile_table,
)
registration = register_store(
"data/experiment.zarr",
"default",
"bioimage.cytotable",
)
print(registration.to_dict())
summary = summarize_store("data/experiment.zarr")
print(summary.to_dict())
contract = validate_microscopy_profile_table("data/cells.parquet")
print(contract.is_valid)
# Requires the optional DuckDB integration:
# pip install 'iceberg-bioimage[duckdb]'
joined = join_profiles_with_store("data/experiment.zarr", "data/cells.parquet")
print(joined.num_rows)
warehouse = ingest_stores_to_warehouse(
["data/experiment-a.zarr", "data/experiment-b.zarr"],
"default",
"bioimage.cytotable",
)
print(warehouse.to_dict())
cytomining_export = export_store_to_cytomining_warehouse(
"data/experiment-a.zarr",
"warehouse-root",
profiles="data/cells.parquet",
profile_dataset_id="experiment-a",
)
print(cytomining_export.to_dict())
iceberg-bioimage scan data/experiment.zarr
iceberg-bioimage summarize data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage ingest --catalog default --namespace bioimage.cytotable data/experiment-a.zarr data/experiment-b.zarr
iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr
iceberg-bioimage publish-chunks --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable --publish-chunks data/experiment.zarr
iceberg-bioimage validate-contract data/cells.parquet
iceberg-bioimage join-profiles data/experiment.zarr data/cells.parquet --output joined.parquet
examples/quickstart.pyfor a minimal scan, publish, and validation scriptexamples/catalog_duckdb.pyfor a catalog-backed query workflowexamples/synthetic_workflow.pyfor a self-contained local workflow
Install optional integrations with:
pip install 'iceberg-bioimage[duckdb]'
pip install 'iceberg-bioimage[ome-arrow]'
DuckDB helpers
DuckDB is supported as an optional integration layer, not as a required engine.
The join helpers also accept common pycytominer and coSMicQC-style
Metadata_* aliases for dataset_id, image_id, plate_id, well_id, and
site_id. If a profile table is missing dataset_id but all rows belong to
one dataset, pass profile_dataset_id=... to the high-level join helpers.
import pyarrow as pa
from iceberg_bioimage import join_image_assets_with_profiles, query_metadata_table
image_assets = pa.table(
{
"dataset_id": ["ds-1"],
"image_id": ["img-1"],
"array_path": ["0"],
"uri": ["data/example.zarr"],
}
)
profiles = pa.table(
{
"dataset_id": ["ds-1"],
"image_id": ["img-1"],
"cell_count": [42],
}
)
joined = join_image_assets_with_profiles(image_assets, profiles)
filtered = query_metadata_table(
joined,
filters=[("cell_count", ">", 10)],
)
Install the optional integration with uv sync --group duckdb.
Cytomining warehouse export
The package supports Cytomining interoperability as a primary workflow.
Besides publishing canonical metadata to Iceberg, it can materialize a
Parquet-backed warehouse root that tools like pycytominer can consume
directly.
from iceberg_bioimage import export_store_to_cytomining_warehouse
result = export_store_to_cytomining_warehouse(
"data/experiment.zarr",
"warehouse-root",
profiles="data/profiles.parquet",
profile_dataset_id="experiment",
)
print(result.to_dict())
This writes one or more of:
images/image_assets/images/chunk_index/profiles/joined_profiles/
It can also append downstream Cytomining tables into the same warehouse root, using namespaces that match table semantics, for example:
profiles/pycytominer_profiles/quality_control/cosmicqc_profiles/
OME-Arrow helpers
OME-Arrow is available as an optional forward-facing integration for tabular image payloads stored in Arrow-compatible formats. Projects may also choose an OME-Arrow-first workflow for source image handling.
from iceberg_bioimage import create_ome_arrow, scan_ome_arrow
oa = create_ome_arrow("image.ome.tiff")
lazy_oa = scan_ome_arrow("image.ome.parquet")
Install it with uv sync --group ome-arrow or
pip install 'iceberg-bioimage[ome-arrow]'.
Local synthetic workflow
For a catalog-free onboarding path, examples/synthetic_workflow.py creates a
small Zarr store and profile table, validates the join contract, derives
canonical metadata rows, and joins them with the optional DuckDB helpers.
Run it with:
uv run --group duckdb python examples/synthetic_workflow.py
Catalog-backed query workflow
If you already published canonical metadata tables, you can read them from a catalog and join them to analysis outputs directly:
import pyarrow as pa
from iceberg_bioimage import join_catalog_image_assets_with_profiles
profiles = pa.table(
{
"dataset_id": ["ds-1"],
"image_id": ["img-1"],
"cell_count": [42],
}
)
joined = join_catalog_image_assets_with_profiles(
"default",
"bioimage.cytotable",
profiles,
chunk_index_table="chunk_index",
)
Documentation
docs/src/getting-started.mdfor first-time setupdocs/src/catalog-setup.mdfor catalog configurationdocs/src/cytomining.mdfor warehouse export workflowsdocs/src/warehouse-spec.mdfor the warehouse interoperability specificationdocs/src/workflow.mdfor CLI-driven end-to-end examples
Troubleshooting
DuckDB helpers require the optional duckdb dependency group: install withpip install 'iceberg-bioimage[duckdb]'oruv sync --group duckdb.Profiles do not satisfy the microscopy join contract: runiceberg-bioimage validate-contract ...and pass--profile-dataset-idwhendataset_idis missing but implied.Missing table: ...for catalog-backed paths: verify catalog configuration, namespace, and table names.
Architecture note
The package focuses on metadata scanning, publishing, Cytomining warehouse export, validation, and joins. OME-Arrow remains the place for Arrow-native image payload handling and lazy image access.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iceberg_bioimage-0.0.3.tar.gz.
File metadata
- Download URL: iceberg_bioimage-0.0.3.tar.gz
- Upload date:
- Size: 528.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b17f9c1447d2db9497bc74e8aa577430dd84649f3116fb2460aca5c480bf5c35
|
|
| MD5 |
f6e0a18b4042fb2c981ea1d08f16e4d2
|
|
| BLAKE2b-256 |
e20aead0110545b29d0b1eff71ae9c6790806e1ddcb2fced4559758d16eb725a
|
Provenance
The following attestation bundles were made for iceberg_bioimage-0.0.3.tar.gz:
Publisher:
publish-pypi.yml on WayScience/iceberg-bioimage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
iceberg_bioimage-0.0.3.tar.gz -
Subject digest:
b17f9c1447d2db9497bc74e8aa577430dd84649f3116fb2460aca5c480bf5c35 - Sigstore transparency entry: 1353558940
- Sigstore integration time:
-
Permalink:
WayScience/iceberg-bioimage@071e6d2d0e92f929fe2c889a9b7e6532e16ff0a7 -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/WayScience
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@071e6d2d0e92f929fe2c889a9b7e6532e16ff0a7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file iceberg_bioimage-0.0.3-py3-none-any.whl.
File metadata
- Download URL: iceberg_bioimage-0.0.3-py3-none-any.whl
- Upload date:
- Size: 37.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5190f20ca925106e104aebb6384699baf8b17928f7d62d11263f5b612d7ba3cb
|
|
| MD5 |
f784c2f9898a1c73cbb05eed063da5a4
|
|
| BLAKE2b-256 |
84b41d72619d96790325b102552dc76657a8ff683a2d6d428878f6ff756f396e
|
Provenance
The following attestation bundles were made for iceberg_bioimage-0.0.3-py3-none-any.whl:
Publisher:
publish-pypi.yml on WayScience/iceberg-bioimage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
iceberg_bioimage-0.0.3-py3-none-any.whl -
Subject digest:
5190f20ca925106e104aebb6384699baf8b17928f7d62d11263f5b612d7ba3cb - Sigstore transparency entry: 1353559060
- Sigstore integration time:
-
Permalink:
WayScience/iceberg-bioimage@071e6d2d0e92f929fe2c889a9b7e6532e16ff0a7 -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/WayScience
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@071e6d2d0e92f929fe2c889a9b7e6532e16ff0a7 -
Trigger Event:
release
-
Statement type: