GPU connected-component labeling for 2D/3D CuPy and DLPack arrays via the Block-based Komura Equivalence algorithm

These details have not been verified by PyPI

Project links

Project description

bkeccl

GPU-accelerated connected-component labeling for 2D and 3D arrays using the Block-based Komura Equivalence (BKE) algorithm (IEEE TPDS 2019). bkeccl is a pure-Python package built on CuPy: the CUDA kernels are compiled on the device at runtime, so there is no compiled extension and no build step.

2D 8-connectivity (2x2 blocks) and 3D 26-connectivity (2x2x2 blocks)
Works with cupy.ndarray or any CUDA DLPack tensor (PyTorch, JAX, Triton, ...) imported zero-copy, so it drops into an existing GPU pipeline without a host round-trip
uint8 and bool inputs used directly; other dtypes reduced to their non-zero mask
Dense (1..N) or raw (native sparse) label output
Per-component boolean mask stacks in one call
A shape-specialized runner for locked-down hot loops

When to use bkeccl

bkeccl is optimized for one workflow: full-connectivity binary CCL on a single GPU array, kept on-device, called repeatedly at a fixed shape. Inside that workflow it is very fast. It deliberately omits the flexibility that workflow does not need (4-connectivity, batching, CPU arrays, on-the-fly shapes).

If you need that flexibility, use cuCIM's cucim.skimage.measure.label: it is a more general, scikit-image-compatible GPU labeler (selectable connectivity, n-dimensional, broader dtype/array handling). It is slightly slower than bkeccl on bkeccl's specialized path but still a fast GPU implementation, and the better choice when you need the generality.

Installation

bkeccl does not depend on a pinned CuPy. CuPy ships as a CUDA-version-specific wheel, so you install the one matching your CUDA toolkit first, then install bkeccl.

Requirements

Python >= 3.10
An NVIDIA GPU and a working CUDA runtime
A CuPy wheel matching your CUDA major version

Step 1: install the CuPy wheel for your CUDA

With pip:

pip install cupy-cuda13x      # CUDA 13.x
# or
pip install cupy-cuda12x      # CUDA 12.x

For a uv-managed project:

uv add cupy-cuda13x      # CUDA 13.x
# or
uv add cupy-cuda12x      # CUDA 12.x

For other CUDA versions or the bundled-toolkit [ctk] extra, see the CuPy install guide. Importing bkeccl without CuPy raises an ImportError that names this step.

Step 2: install bkeccl

pip install bkeccl

For a uv-managed project:

uv add bkeccl

Install from source

git clone https://github.com/belfner/bkeccl
cd bkeccl
pip install .

Usage

Goal	Function
Label a 2D image	`ccl(img)`
Label a 3D volume	`ccl(vol, dimensionality="3D")`
Native sparse labels (skip the densify pass)	`ccl(img, output_mode="raw")`
One boolean mask per component	`ccl_masks(img)`
Masks from a label map you already have	`masks_from_labels(labels, count)`
Skip the per-call output allocation	`ccl(img, label_output=buf)`
Repeated fixed-shape calls in a hot loop	`make_ccl_2D` / `make_ccl_3D`

Core workflow

Labeling an array and reading the result. Most code needs only this section.

Label a 2D image

import cupy
from bkeccl import ccl

# Binary image on the GPU. Any non-zero value is foreground; 0 is background.
img = cupy.asarray(
    [[1, 1, 0, 0],
     [0, 0, 0, 1],
     [0, 1, 1, 1]],
    dtype=cupy.uint8,
)

labels, count = ccl(img)
# labels: int32 (H, W), background 0, components renumbered 1..N (dense).
# count:  0-dim int32 device array holding N (read with int(count)).

ccl returns (labels, count) in the default output_mode="dense". count is kept on the device; calling int(count) is the one host synchronization.

Label a 3D volume

Pass dimensionality="3D" (26-connectivity on 2x2x2 blocks):

vol = cupy.asarray(volume, dtype=cupy.uint8)        # (D, H, W)
labels, count = ccl(vol, dimensionality="3D")

Use a tensor from another framework

Any GPU-resident object exposing the DLPack protocol is imported zero-copy, so a PyTorch/JAX/Triton CUDA tensor can be passed directly:

import torch
from bkeccl import ccl

t = torch.zeros((512, 512), dtype=torch.uint8, device="cuda")
t[100:200, 100:200] = 1

labels, count = ccl(t)        # zero-copy import, no host round-trip

Host-resident tensors are rejected with a TypeError rather than silently staged onto the device.

Inspecting components

Native (raw) labels

output_mode="raw" skips the densify pass and returns the labeler's native sparse, root-derived IDs (positive, not a contiguous 1..N range), as a single array:

raw = ccl(img, output_mode="raw")     # int32 (H, W), sparse positive IDs

Use raw when you will reduce or threshold the labels yourself and do not need contiguous IDs; use the default dense mode when you want 1..N.

Component masks

from bkeccl import ccl_masks

masks = ccl_masks(img)                 # bool (N, H, W), one plane per component

ccl_masks labels and expands in one call, staying on the GPU; the only host sync is the count read needed to size the stack. For a label map you already have, expand it directly:

from bkeccl import ccl, masks_from_labels

labels, count = ccl(img)
masks = masks_from_labels(labels, count)   # bool (N, H, W)

For thousands of components the dense (N, *spatial) stack is large; keep the label map instead when you do not need per-component planes.

Performance and reuse

Opt in once the basic workflow works and you are calling it repeatedly.

Buffer reuse

Supply a pre-allocated int32 output buffer to skip the per-call output allocation. It is keyword-only and the result aliases it:

import cupy
from bkeccl import ccl

buf = cupy.empty((512, 512), dtype=cupy.int32)

results = []
for img in image_batch:
    labels, count = ccl(img, label_output=buf)
    # `labels` IS `buf` (aliased). Retaining it across iterations requires a
    # copy, or every stored result points at the last image.
    results.append(labels.copy())

The buffer must be C-contiguous int32 with the same shape as the input.

Shape-specialized runner

For a hot loop with a fixed shape, make_ccl_2D / make_ccl_3D bake the launch grid, kernel handles, and all scratch once and return a reusable callable. The factory is memoized per argument set:

from bkeccl import make_ccl_2D

runner = make_ccl_2D((512, 512), algorithm="bke", output_mode="dense")
for img in image_batch:
    labels, count = runner(img)        # consume or copy before the next call

The runner performs no input validation and reuses one process-owned output buffer: the caller must uphold the shape/dtype/device contract, and must consume or copy the result before the next call. It is not thread- or stream-safe. Use ccl unless you can guarantee that contract. Full contract in REFERENCE.md.

Algorithm variants

labels, count = ccl(img, algorithm="bke")      # default, standard BKE
labels, count = ccl(img, algorithm="bke_ic")   # inline path compression

Both produce the same partition. bke is the default; benchmark bke_ic against it on your workload if variant choice matters.

API Reference

Full per-function signatures, parameters, return types, and raised exceptions are in REFERENCE.md.

Algorithm

bkeccl implements the Block-based Komura Equivalence (BKE) algorithm:

Stefano Allegretti, Federico Bolelli, Michele Cancilla, Costantino Grana. "Optimized Block-Based Algorithms to Label Connected Components on GPUs." IEEE Transactions on Parallel and Distributed Systems, 31(2):423-438, 2019. IEEE Xplore - doi:10.1109/TPDS.2019.2934683 - (open-access PDF)

BKE operates on 2x2 blocks (2D) or 2x2x2 blocks (3D) rather than individual elements, reducing memory accesses and atomic operations. The pipeline is Init (block connectivity) -> Compress (path compression) -> Reduction (union for remaining connections) -> Compress -> FinalLabel (block labels to elements). The bke_ic variant updates the parent at each traversal step (inline compression) for faster convergence. The block structure makes 8-connectivity (2D) and 26-connectivity (3D) the native, fixed connectivity. The 3D variant follows Section 5.3 ("3D extension") of the same paper.

Limitations

CUDA-only. Inputs must be GPU-resident; host arrays are rejected with a TypeError, with no CPU fallback.
2D or 3D single array. No batched API; iterate in Python for batches.
Fixed connectivity. 2D is 8-connected and 3D is 26-connected; lower connectivities are not provided (the block structure is intrinsically fully-connected).
int32 label space. Index arithmetic is int32; arrays with more than ~2^31 elements are unsupported.
Non-contiguous input is copied. A contiguous uint8 temporary is allocated internally, so buffer reuse is allocation-free only for contiguous input.
Concurrency. The shape-specialized runners and their reused buffers are single-context: one per device, per stream, per in-flight op.

Semantics

dimensionality must match ndim. A ValueError is raised on mismatch.
Dense vs. raw labels. The default output_mode="dense" returns a deterministic 1..N renumbering. output_mode="raw" IDs are positive but root-derived, not contiguous; use dense mode when you need 1..N.
Supplied buffers. label_output must be C-contiguous int32 with the input's shape; the return aliases it.
Stream/device. Launches use the current CUDA stream and device; correctness under CUDA graph capture or multi-stream pipelines requires the caller to manage stream/event ordering.
Determinism. For a given input and algorithm the output is reproducible run-to-run, and the partition into components is stable. Dense labels are a deterministic 1..N renumbering of that partition. Raw label values are union-find-root-derived: sparse, and specific to the device and algorithm, so they can differ across devices and between bke/bke_ic.

Efficiency Tips

Reuse a buffer with label_output= for repeated same-shape calls; the return aliases the buffer, so .copy() any result you retain.
Keep inputs contiguous (cupy.ascontiguousarray(x) once, upstream) so buffer reuse actually avoids all allocations.
Use a shape-specialized runner for a fixed-shape hot loop; one runner per stream/device.
Read count once. int(count) (and ccl_masks / masks_from_labels) forces a host sync; avoid it inside tight inner loops where you only need the label map.
Prefer output_mode="raw" when you do not need contiguous IDs; it skips the densify pass.
Prefer keeping the label map over ccl_masks when you do not need per-component planes; the (N, *spatial) stack memory dominates for many components.

Measuring on your workload

Performance depends on array size, component count and size, label-field contiguity, allocation mode, and GPU architecture. Benchmark with your data and a warmed device:

import cupy
from bkeccl import make_ccl_2D

img = your_binary_image_cupy
runner = make_ccl_2D(img.shape)
for _ in range(10):                       # warm up (includes kernel compile)
    runner(img)
cupy.cuda.Device().synchronize()
start = cupy.cuda.Event(); end = cupy.cuda.Event()
start.record()
for _ in range(100):
    runner(img)
end.record(); end.synchronize()
print(cupy.cuda.get_elapsed_time(start, end) / 100, "ms/call")

Compare against scipy.ndimage.label or cc3d on your inputs for a meaningful baseline.

Development

Development uses uv. The dev dependency group pins a CuPy wheel so uv sync and the GPU test suite run locally:

uv sync
uv run ruff check src/
uv run ruff format src/
uv run mypy
uv run pytest -q

The test suite requires a usable CUDA device; it aborts with a clear error if none is visible.

License

MIT License - see LICENSE.

Citation

@article{allegretti2019optimized,
  title={Optimized Block-Based Algorithms to Label Connected Components on GPUs},
  author={Allegretti, Stefano and Bolelli, Federico and Cancilla, Michele and Grana, Costantino},
  journal={IEEE Transactions on Parallel and Distributed Systems},
  volume={31},
  number={2},
  pages={423--438},
  year={2019},
  publisher={IEEE},
  doi={10.1109/TPDS.2019.2934683}
}

Contributing

Contributions are welcome. Please open a Pull Request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bkeccl-0.1.0.tar.gz (30.5 kB view details)

Uploaded May 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bkeccl-0.1.0-py3-none-any.whl (35.7 kB view details)

Uploaded May 18, 2026 Python 3

File details

Details for the file bkeccl-0.1.0.tar.gz.

File metadata

Download URL: bkeccl-0.1.0.tar.gz
Upload date: May 18, 2026
Size: 30.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.4

File hashes

Hashes for bkeccl-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`177d5867c05b36b2c3a7658f24ce1fb2896c1af68385cee1026f4df78de5fa3f`
MD5	`cd1e95d2c664cbf92fa9e18c76469e81`
BLAKE2b-256	`9ea40e56cc8f8e8e26b09d829f7cb76352a644de486a1e0c7215402c1f902f32`

See more details on using hashes here.

File details

Details for the file bkeccl-0.1.0-py3-none-any.whl.

File metadata

Download URL: bkeccl-0.1.0-py3-none-any.whl
Upload date: May 18, 2026
Size: 35.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.4

File hashes

Hashes for bkeccl-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b2aeece10befa9be65bb557ae565717ac9205d3aef68b29b4ccfa6a97f2b5d97`
MD5	`24320e47abad1e5d960476d8e4b0546a`
BLAKE2b-256	`f962689ec5f022d8344f0cc1455eaa2c31cb502ed2ac093e46f45543727bb274`

See more details on using hashes here.

bkeccl 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

bkeccl

When to use bkeccl

Installation

Requirements

Step 1: install the CuPy wheel for your CUDA

Step 2: install bkeccl

Install from source

Usage

Core workflow

Label a 2D image

Label a 3D volume

Use a tensor from another framework

Inspecting components

Native (raw) labels

Component masks

Performance and reuse

Buffer reuse

Shape-specialized runner

Algorithm variants

API Reference

Algorithm

Limitations

Semantics

Efficiency Tips

Measuring on your workload

Development

License

Citation

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes