Skip to main content

GPU connected-component labeling for 2D/3D CuPy and DLPack arrays via the Block-based Komura Equivalence algorithm

Project description

bkeccl

GPU-accelerated connected-component labeling for 2D and 3D arrays using the Block-based Komura Equivalence (BKE) algorithm (IEEE TPDS 2019). bkeccl is a pure-Python package built on CuPy: the CUDA kernels are compiled on the device at runtime, so there is no compiled extension and no build step.

  • 2D 8-connectivity (2x2 blocks) and 3D 26-connectivity (2x2x2 blocks)
  • Works with cupy.ndarray or any CUDA DLPack tensor (PyTorch, JAX, Triton, ...) imported zero-copy, so it drops into an existing GPU pipeline without a host round-trip
  • uint8 and bool inputs used directly; other dtypes reduced to their non-zero mask
  • Dense (1..N) or raw (native sparse) label output
  • Per-component boolean mask stacks in one call
  • A shape-specialized runner for locked-down hot loops

When to use bkeccl

bkeccl is optimized for one workflow: full-connectivity binary CCL on a single GPU array, kept on-device, called repeatedly at a fixed shape. Inside that workflow it is very fast. It deliberately omits the flexibility that workflow does not need (4-connectivity, batching, CPU arrays, on-the-fly shapes).

If you need that flexibility, use cuCIM's cucim.skimage.measure.label: it is a more general, scikit-image-compatible GPU labeler (selectable connectivity, n-dimensional, broader dtype/array handling). It is slightly slower than bkeccl on bkeccl's specialized path but still a fast GPU implementation, and the better choice when you need the generality.

Installation

bkeccl does not depend on a pinned CuPy. CuPy ships as a CUDA-version-specific wheel, so you install the one matching your CUDA toolkit first, then install bkeccl.

Requirements

  • Python >= 3.10
  • An NVIDIA GPU and a working CUDA runtime
  • A CuPy wheel matching your CUDA major version

Step 1: install the CuPy wheel for your CUDA

With pip:

pip install cupy-cuda13x      # CUDA 13.x
# or
pip install cupy-cuda12x      # CUDA 12.x

For a uv-managed project:

uv add cupy-cuda13x      # CUDA 13.x
# or
uv add cupy-cuda12x      # CUDA 12.x

For other CUDA versions or the bundled-toolkit [ctk] extra, see the CuPy install guide. Importing bkeccl without CuPy raises an ImportError that names this step.

Step 2: install bkeccl

pip install bkeccl

For a uv-managed project:

uv add bkeccl

Install from source

git clone https://github.com/belfner/bkeccl
cd bkeccl
pip install .

Usage

Goal Function
Label a 2D image ccl(img)
Label a 3D volume ccl(vol, dimensionality="3D")
Native sparse labels (skip the densify pass) ccl(img, output_mode="raw")
One boolean mask per component ccl_masks(img)
Masks from a label map you already have masks_from_labels(labels, count)
Skip the per-call output allocation ccl(img, label_output=buf)
Repeated fixed-shape calls in a hot loop make_ccl_2D / make_ccl_3D

Core workflow

Labeling an array and reading the result. Most code needs only this section.

Label a 2D image

import cupy
from bkeccl import ccl

# Binary image on the GPU. Any non-zero value is foreground; 0 is background.
img = cupy.asarray(
    [[1, 1, 0, 0],
     [0, 0, 0, 1],
     [0, 1, 1, 1]],
    dtype=cupy.uint8,
)

labels, count = ccl(img)
# labels: int32 (H, W), background 0, components renumbered 1..N (dense).
# count:  0-dim int32 device array holding N (read with int(count)).

ccl returns (labels, count) in the default output_mode="dense". count is kept on the device; calling int(count) is the one host synchronization.

Label a 3D volume

Pass dimensionality="3D" (26-connectivity on 2x2x2 blocks):

vol = cupy.asarray(volume, dtype=cupy.uint8)        # (D, H, W)
labels, count = ccl(vol, dimensionality="3D")

Use a tensor from another framework

Any GPU-resident object exposing the DLPack protocol is imported zero-copy, so a PyTorch/JAX/Triton CUDA tensor can be passed directly:

import torch
from bkeccl import ccl

t = torch.zeros((512, 512), dtype=torch.uint8, device="cuda")
t[100:200, 100:200] = 1

labels, count = ccl(t)        # zero-copy import, no host round-trip

Host-resident tensors are rejected with a TypeError rather than silently staged onto the device.

Inspecting components

Native (raw) labels

output_mode="raw" skips the densify pass and returns the labeler's native sparse, root-derived IDs (positive, not a contiguous 1..N range), as a single array:

raw = ccl(img, output_mode="raw")     # int32 (H, W), sparse positive IDs

Use raw when you will reduce or threshold the labels yourself and do not need contiguous IDs; use the default dense mode when you want 1..N.

Component masks

from bkeccl import ccl_masks

masks = ccl_masks(img)                 # bool (N, H, W), one plane per component

ccl_masks labels and expands in one call, staying on the GPU; the only host sync is the count read needed to size the stack. For a label map you already have, expand it directly:

from bkeccl import ccl, masks_from_labels

labels, count = ccl(img)
masks = masks_from_labels(labels, count)   # bool (N, H, W)

For thousands of components the dense (N, *spatial) stack is large; keep the label map instead when you do not need per-component planes.

Performance and reuse

Opt in once the basic workflow works and you are calling it repeatedly.

Buffer reuse

Supply a pre-allocated int32 output buffer to skip the per-call output allocation. It is keyword-only and the result aliases it:

import cupy
from bkeccl import ccl

buf = cupy.empty((512, 512), dtype=cupy.int32)

results = []
for img in image_batch:
    labels, count = ccl(img, label_output=buf)
    # `labels` IS `buf` (aliased). Retaining it across iterations requires a
    # copy, or every stored result points at the last image.
    results.append(labels.copy())

The buffer must be C-contiguous int32 with the same shape as the input.

Shape-specialized runner

For a hot loop with a fixed shape, make_ccl_2D / make_ccl_3D bake the launch grid, kernel handles, and all scratch once and return a reusable callable. The factory is memoized per argument set:

from bkeccl import make_ccl_2D

runner = make_ccl_2D((512, 512), algorithm="bke", output_mode="dense")
for img in image_batch:
    labels, count = runner(img)        # consume or copy before the next call

The runner performs no input validation and reuses one process-owned output buffer: the caller must uphold the shape/dtype/device contract, and must consume or copy the result before the next call. It is not thread- or stream-safe. Use ccl unless you can guarantee that contract. Full contract in REFERENCE.md.

Algorithm variants

labels, count = ccl(img, algorithm="bke")      # default, standard BKE
labels, count = ccl(img, algorithm="bke_ic")   # inline path compression

Both produce the same partition. bke is the default; benchmark bke_ic against it on your workload if variant choice matters.

API Reference

Full per-function signatures, parameters, return types, and raised exceptions are in REFERENCE.md.

Algorithm

bkeccl implements the Block-based Komura Equivalence (BKE) algorithm:

Stefano Allegretti, Federico Bolelli, Michele Cancilla, Costantino Grana. "Optimized Block-Based Algorithms to Label Connected Components on GPUs." IEEE Transactions on Parallel and Distributed Systems, 31(2):423-438, 2019. IEEE Xplore - doi:10.1109/TPDS.2019.2934683 - (open-access PDF)

BKE operates on 2x2 blocks (2D) or 2x2x2 blocks (3D) rather than individual elements, reducing memory accesses and atomic operations. The pipeline is Init (block connectivity) -> Compress (path compression) -> Reduction (union for remaining connections) -> Compress -> FinalLabel (block labels to elements). The bke_ic variant updates the parent at each traversal step (inline compression) for faster convergence. The block structure makes 8-connectivity (2D) and 26-connectivity (3D) the native, fixed connectivity. The 3D variant follows Section 5.3 ("3D extension") of the same paper.

Limitations

  • CUDA-only. Inputs must be GPU-resident; host arrays are rejected with a TypeError, with no CPU fallback.
  • 2D or 3D single array. No batched API; iterate in Python for batches.
  • Fixed connectivity. 2D is 8-connected and 3D is 26-connected; lower connectivities are not provided (the block structure is intrinsically fully-connected).
  • int32 label space. Index arithmetic is int32; arrays with more than ~2^31 elements are unsupported.
  • Non-contiguous input is copied. A contiguous uint8 temporary is allocated internally, so buffer reuse is allocation-free only for contiguous input.
  • Concurrency. The shape-specialized runners and their reused buffers are single-context: one per device, per stream, per in-flight op.

Semantics

  • dimensionality must match ndim. A ValueError is raised on mismatch.
  • Dense vs. raw labels. The default output_mode="dense" returns a deterministic 1..N renumbering. output_mode="raw" IDs are positive but root-derived, not contiguous; use dense mode when you need 1..N.
  • Supplied buffers. label_output must be C-contiguous int32 with the input's shape; the return aliases it.
  • Stream/device. Launches use the current CUDA stream and device; correctness under CUDA graph capture or multi-stream pipelines requires the caller to manage stream/event ordering.
  • Determinism. For a given input and algorithm the output is reproducible run-to-run, and the partition into components is stable. Dense labels are a deterministic 1..N renumbering of that partition. Raw label values are union-find-root-derived: sparse, and specific to the device and algorithm, so they can differ across devices and between bke/bke_ic.

Efficiency Tips

  • Reuse a buffer with label_output= for repeated same-shape calls; the return aliases the buffer, so .copy() any result you retain.
  • Keep inputs contiguous (cupy.ascontiguousarray(x) once, upstream) so buffer reuse actually avoids all allocations.
  • Use a shape-specialized runner for a fixed-shape hot loop; one runner per stream/device.
  • Read count once. int(count) (and ccl_masks / masks_from_labels) forces a host sync; avoid it inside tight inner loops where you only need the label map.
  • Prefer output_mode="raw" when you do not need contiguous IDs; it skips the densify pass.
  • Prefer keeping the label map over ccl_masks when you do not need per-component planes; the (N, *spatial) stack memory dominates for many components.

Measuring on your workload

Performance depends on array size, component count and size, label-field contiguity, allocation mode, and GPU architecture. Benchmark with your data and a warmed device:

import cupy
from bkeccl import make_ccl_2D

img = your_binary_image_cupy
runner = make_ccl_2D(img.shape)
for _ in range(10):                       # warm up (includes kernel compile)
    runner(img)
cupy.cuda.Device().synchronize()
start = cupy.cuda.Event(); end = cupy.cuda.Event()
start.record()
for _ in range(100):
    runner(img)
end.record(); end.synchronize()
print(cupy.cuda.get_elapsed_time(start, end) / 100, "ms/call")

Compare against scipy.ndimage.label or cc3d on your inputs for a meaningful baseline.

Development

Development uses uv. The dev dependency group pins a CuPy wheel so uv sync and the GPU test suite run locally:

uv sync
uv run ruff check src/
uv run ruff format src/
uv run mypy
uv run pytest -q

The test suite requires a usable CUDA device; it aborts with a clear error if none is visible.

License

MIT License - see LICENSE.

Citation

@article{allegretti2019optimized,
  title={Optimized Block-Based Algorithms to Label Connected Components on GPUs},
  author={Allegretti, Stefano and Bolelli, Federico and Cancilla, Michele and Grana, Costantino},
  journal={IEEE Transactions on Parallel and Distributed Systems},
  volume={31},
  number={2},
  pages={423--438},
  year={2019},
  publisher={IEEE},
  doi={10.1109/TPDS.2019.2934683}
}

Contributing

Contributions are welcome. Please open a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bkeccl-0.1.0.tar.gz (30.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bkeccl-0.1.0-py3-none-any.whl (35.7 kB view details)

Uploaded Python 3

File details

Details for the file bkeccl-0.1.0.tar.gz.

File metadata

  • Download URL: bkeccl-0.1.0.tar.gz
  • Upload date:
  • Size: 30.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.4

File hashes

Hashes for bkeccl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 177d5867c05b36b2c3a7658f24ce1fb2896c1af68385cee1026f4df78de5fa3f
MD5 cd1e95d2c664cbf92fa9e18c76469e81
BLAKE2b-256 9ea40e56cc8f8e8e26b09d829f7cb76352a644de486a1e0c7215402c1f902f32

See more details on using hashes here.

File details

Details for the file bkeccl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: bkeccl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 35.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.4

File hashes

Hashes for bkeccl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b2aeece10befa9be65bb557ae565717ac9205d3aef68b29b4ccfa6a97f2b5d97
MD5 24320e47abad1e5d960476d8e4b0546a
BLAKE2b-256 f962689ec5f022d8344f0cc1455eaa2c31cb502ed2ac093e46f45543727bb274

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page