GPU connected-component labeling for 2D/3D CuPy and DLPack arrays via the Block-based Komura Equivalence algorithm
Project description
bkeccl
GPU-accelerated connected-component labeling for 2D and 3D arrays using the Block-based Komura Equivalence (BKE) algorithm (IEEE TPDS 2019). bkeccl is a pure-Python package built on CuPy: the CUDA kernels are compiled on the device at runtime, so there is no compiled extension and no build step.
- 2D 8-connectivity (2x2 blocks) and 3D 26-connectivity (2x2x2 blocks)
- Works with
cupy.ndarrayor any CUDA DLPack tensor (PyTorch, JAX, Triton, ...) imported zero-copy, so it drops into an existing GPU pipeline without a host round-trip uint8andboolinputs used directly; other dtypes reduced to their non-zero mask- Dense (
1..N) or raw (native sparse) label output - Per-component boolean mask stacks in one call
- A shape-specialized runner for locked-down hot loops
When to use bkeccl
bkeccl is optimized for one workflow: full-connectivity binary CCL on a single GPU array, kept on-device, called repeatedly at a fixed shape. Inside that workflow it is very fast. It deliberately omits the flexibility that workflow does not need (4-connectivity, batching, CPU arrays, on-the-fly shapes).
If you need that flexibility, use
cuCIM's
cucim.skimage.measure.label:
it is a more general, scikit-image-compatible GPU labeler (selectable
connectivity, n-dimensional, broader dtype/array handling). It is slightly
slower than bkeccl on bkeccl's specialized path but still a fast GPU
implementation, and the better choice when you need the generality.
Installation
bkeccl does not depend on a pinned CuPy. CuPy ships as a CUDA-version-specific wheel, so you install the one matching your CUDA toolkit first, then install bkeccl.
Requirements
- Python >= 3.10
- An NVIDIA GPU and a working CUDA runtime
- A CuPy wheel matching your CUDA major version
Step 1: install the CuPy wheel for your CUDA
With pip:
pip install cupy-cuda13x # CUDA 13.x
# or
pip install cupy-cuda12x # CUDA 12.x
For a uv-managed project:
uv add cupy-cuda13x # CUDA 13.x
# or
uv add cupy-cuda12x # CUDA 12.x
For other CUDA versions or the bundled-toolkit [ctk] extra, see the
CuPy install guide. Importing
bkeccl without CuPy raises an ImportError that names this step.
Step 2: install bkeccl
pip install bkeccl
For a uv-managed project:
uv add bkeccl
Install from source
git clone https://github.com/belfner/bkeccl
cd bkeccl
pip install .
Usage
| Goal | Function |
|---|---|
| Label a 2D image | ccl(img) |
| Label a 3D volume | ccl(vol, dimensionality="3D") |
| Native sparse labels (skip the densify pass) | ccl(img, output_mode="raw") |
| One boolean mask per component | ccl_masks(img) |
| Masks from a label map you already have | masks_from_labels(labels, count) |
| Skip the per-call output allocation | ccl(img, label_output=buf) |
| Repeated fixed-shape calls in a hot loop | make_ccl_2D / make_ccl_3D |
Core workflow
Labeling an array and reading the result. Most code needs only this section.
Label a 2D image
import cupy
from bkeccl import ccl
# Binary image on the GPU. Any non-zero value is foreground; 0 is background.
img = cupy.asarray(
[[1, 1, 0, 0],
[0, 0, 0, 1],
[0, 1, 1, 1]],
dtype=cupy.uint8,
)
labels, count = ccl(img)
# labels: int32 (H, W), background 0, components renumbered 1..N (dense).
# count: 0-dim int32 device array holding N (read with int(count)).
ccl returns (labels, count) in the default output_mode="dense". count
is kept on the device; calling int(count) is the one host synchronization.
Label a 3D volume
Pass dimensionality="3D" (26-connectivity on 2x2x2 blocks):
vol = cupy.asarray(volume, dtype=cupy.uint8) # (D, H, W)
labels, count = ccl(vol, dimensionality="3D")
Use a tensor from another framework
Any GPU-resident object exposing the DLPack protocol is imported zero-copy, so a PyTorch/JAX/Triton CUDA tensor can be passed directly:
import torch
from bkeccl import ccl
t = torch.zeros((512, 512), dtype=torch.uint8, device="cuda")
t[100:200, 100:200] = 1
labels, count = ccl(t) # zero-copy import, no host round-trip
Host-resident tensors are rejected with a TypeError rather than silently
staged onto the device.
Inspecting components
Native (raw) labels
output_mode="raw" skips the densify pass and returns the labeler's native
sparse, root-derived IDs (positive, not a contiguous 1..N range), as a single
array:
raw = ccl(img, output_mode="raw") # int32 (H, W), sparse positive IDs
Use raw when you will reduce or threshold the labels yourself and do not need
contiguous IDs; use the default dense mode when you want 1..N.
Component masks
from bkeccl import ccl_masks
masks = ccl_masks(img) # bool (N, H, W), one plane per component
ccl_masks labels and expands in one call, staying on the GPU; the only host
sync is the count read needed to size the stack. For a label map you already
have, expand it directly:
from bkeccl import ccl, masks_from_labels
labels, count = ccl(img)
masks = masks_from_labels(labels, count) # bool (N, H, W)
For thousands of components the dense (N, *spatial) stack is large; keep the
label map instead when you do not need per-component planes.
Performance and reuse
Opt in once the basic workflow works and you are calling it repeatedly.
Buffer reuse
Supply a pre-allocated int32 output buffer to skip the per-call output
allocation. It is keyword-only and the result aliases it:
import cupy
from bkeccl import ccl
buf = cupy.empty((512, 512), dtype=cupy.int32)
results = []
for img in image_batch:
labels, count = ccl(img, label_output=buf)
# `labels` IS `buf` (aliased). Retaining it across iterations requires a
# copy, or every stored result points at the last image.
results.append(labels.copy())
The buffer must be C-contiguous int32 with the same shape as the input.
Shape-specialized runner
For a hot loop with a fixed shape, make_ccl_2D / make_ccl_3D bake the launch
grid, kernel handles, and all scratch once and return a reusable callable. The
factory is memoized per argument set:
from bkeccl import make_ccl_2D
runner = make_ccl_2D((512, 512), algorithm="bke", output_mode="dense")
for img in image_batch:
labels, count = runner(img) # consume or copy before the next call
The runner performs no input validation and reuses one process-owned
output buffer: the caller must uphold the shape/dtype/device contract, and must
consume or copy the result before the next call. It is not thread- or
stream-safe. Use ccl unless you can guarantee that contract. Full contract in
REFERENCE.md.
Algorithm variants
labels, count = ccl(img, algorithm="bke") # default, standard BKE
labels, count = ccl(img, algorithm="bke_ic") # inline path compression
Both produce the same partition. bke is the default; benchmark bke_ic
against it on your workload if variant choice matters.
API Reference
Full per-function signatures, parameters, return types, and raised exceptions
are in REFERENCE.md.
Algorithm
bkeccl implements the Block-based Komura Equivalence (BKE) algorithm:
Stefano Allegretti, Federico Bolelli, Michele Cancilla, Costantino Grana. "Optimized Block-Based Algorithms to Label Connected Components on GPUs." IEEE Transactions on Parallel and Distributed Systems, 31(2):423-438, 2019. IEEE Xplore - doi:10.1109/TPDS.2019.2934683 - (open-access PDF)
BKE operates on 2x2 blocks (2D) or 2x2x2 blocks (3D) rather than
individual elements, reducing memory accesses and atomic operations. The
pipeline is Init (block connectivity) -> Compress (path compression) ->
Reduction (union for remaining connections) -> Compress -> FinalLabel (block
labels to elements). The bke_ic variant updates the parent at each traversal
step (inline compression) for faster convergence. The block structure makes
8-connectivity (2D) and 26-connectivity (3D) the native, fixed connectivity.
The 3D variant follows Section 5.3 ("3D extension") of the same paper.
Limitations
- CUDA-only. Inputs must be GPU-resident; host arrays are rejected with a
TypeError, with no CPU fallback. - 2D or 3D single array. No batched API; iterate in Python for batches.
- Fixed connectivity. 2D is 8-connected and 3D is 26-connected; lower connectivities are not provided (the block structure is intrinsically fully-connected).
- int32 label space. Index arithmetic is
int32; arrays with more than ~2^31 elements are unsupported. - Non-contiguous input is copied. A contiguous
uint8temporary is allocated internally, so buffer reuse is allocation-free only for contiguous input. - Concurrency. The shape-specialized runners and their reused buffers are single-context: one per device, per stream, per in-flight op.
Semantics
dimensionalitymust matchndim. AValueErroris raised on mismatch.- Dense vs. raw labels. The default
output_mode="dense"returns a deterministic1..Nrenumbering.output_mode="raw"IDs are positive but root-derived, not contiguous; use dense mode when you need1..N. - Supplied buffers.
label_outputmust be C-contiguousint32with the input's shape; the return aliases it. - Stream/device. Launches use the current CUDA stream and device; correctness under CUDA graph capture or multi-stream pipelines requires the caller to manage stream/event ordering.
- Determinism. For a given input and algorithm the output is reproducible
run-to-run, and the partition into components is stable. Dense labels are a
deterministic
1..Nrenumbering of that partition. Raw label values are union-find-root-derived: sparse, and specific to the device and algorithm, so they can differ across devices and betweenbke/bke_ic.
Efficiency Tips
- Reuse a buffer with
label_output=for repeated same-shape calls; the return aliases the buffer, so.copy()any result you retain. - Keep inputs contiguous (
cupy.ascontiguousarray(x)once, upstream) so buffer reuse actually avoids all allocations. - Use a shape-specialized runner for a fixed-shape hot loop; one runner per stream/device.
- Read
countonce.int(count)(andccl_masks/masks_from_labels) forces a host sync; avoid it inside tight inner loops where you only need the label map. - Prefer
output_mode="raw"when you do not need contiguous IDs; it skips the densify pass. - Prefer keeping the label map over
ccl_maskswhen you do not need per-component planes; the(N, *spatial)stack memory dominates for many components.
Measuring on your workload
Performance depends on array size, component count and size, label-field contiguity, allocation mode, and GPU architecture. Benchmark with your data and a warmed device:
import cupy
from bkeccl import make_ccl_2D
img = your_binary_image_cupy
runner = make_ccl_2D(img.shape)
for _ in range(10): # warm up (includes kernel compile)
runner(img)
cupy.cuda.Device().synchronize()
start = cupy.cuda.Event(); end = cupy.cuda.Event()
start.record()
for _ in range(100):
runner(img)
end.record(); end.synchronize()
print(cupy.cuda.get_elapsed_time(start, end) / 100, "ms/call")
Compare against scipy.ndimage.label or cc3d on your inputs for a meaningful
baseline.
Development
Development uses uv. The dev dependency group
pins a CuPy wheel so uv sync and the GPU test suite run locally:
uv sync
uv run ruff check src/
uv run ruff format src/
uv run mypy
uv run pytest -q
The test suite requires a usable CUDA device; it aborts with a clear error if none is visible.
License
MIT License - see LICENSE.
Citation
@article{allegretti2019optimized,
title={Optimized Block-Based Algorithms to Label Connected Components on GPUs},
author={Allegretti, Stefano and Bolelli, Federico and Cancilla, Michele and Grana, Costantino},
journal={IEEE Transactions on Parallel and Distributed Systems},
volume={31},
number={2},
pages={423--438},
year={2019},
publisher={IEEE},
doi={10.1109/TPDS.2019.2934683}
}
Contributing
Contributions are welcome. Please open a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bkeccl-0.1.0.tar.gz.
File metadata
- Download URL: bkeccl-0.1.0.tar.gz
- Upload date:
- Size: 30.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
177d5867c05b36b2c3a7658f24ce1fb2896c1af68385cee1026f4df78de5fa3f
|
|
| MD5 |
cd1e95d2c664cbf92fa9e18c76469e81
|
|
| BLAKE2b-256 |
9ea40e56cc8f8e8e26b09d829f7cb76352a644de486a1e0c7215402c1f902f32
|
File details
Details for the file bkeccl-0.1.0-py3-none-any.whl.
File metadata
- Download URL: bkeccl-0.1.0-py3-none-any.whl
- Upload date:
- Size: 35.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2aeece10befa9be65bb557ae565717ac9205d3aef68b29b4ccfa6a97f2b5d97
|
|
| MD5 |
24320e47abad1e5d960476d8e4b0546a
|
|
| BLAKE2b-256 |
f962689ec5f022d8344f0cc1455eaa2c31cb502ed2ac093e46f45543727bb274
|