Skip to main content

No project description provided

Project description

numbarrow

Numba adapters for PyArrow and PySpark.

numbarrow lets you work with Apache Arrow arrays directly inside Numba @njit compiled functions. It converts PyArrow arrays into NumPy views (zero-copy where possible) and extracts validity bitmaps for null handling — bridging PySpark's Arrow-based batch processing with high-performance JIT-compiled code.

Installation

pip install numbarrow

Optional dependencies for PySpark and pandas support:

pip install numbarrow[test]       # adds pyspark
pip install numbarrow[mapinarrow] # adds pandas

Quick Start

import pyarrow as pa
from numba import njit
from numbarrow.core.adapters import arrow_array_adapter
from numbarrow.core.is_null import is_null

# Convert a PyArrow array to NumPy for use in @njit
arrow_array = pa.array([10, None, 30, 40], type=pa.int32())
bitmap, data = arrow_array_adapter(arrow_array)

@njit
def sum_non_null(data, bitmap):
    total = 0
    for i in range(len(data)):
        if bitmap is None or not is_null(i, bitmap):
            total += data[i]
    return total

result = sum_non_null(data, bitmap)  # 80

Supported Types

PyArrow Type NumPy Result Copy?
Int32Array, Int64Array, DoubleArray Matching dtype No (view)
BooleanArray bool_ Yes (bit-unpacking)
Date32Array datetime64[D] Yes (int32 → int64)
Date64Array datetime64[ms] No (view)
TimestampArray datetime64[unit] No (view)
StringArray Fixed-width Unicode (bitmap not returned) Yes (repacking)
StructArray Tuple of two dicts: (bitmaps, data) per field Per-field
ListArray (of structs) Tuple of two dicts: (bitmaps, data) per field Per-field

PySpark Integration

Use make_mapinarrow_func to create functions compatible with PySpark's mapInArrow:

from numbarrow.core.mapinarrow_factory import make_mapinarrow_func

def compute(data_dict, bitmap_dict, broadcasts):
    # data_dict: {col_name: np.ndarray}
    # bitmap_dict: {col_name: uint8 bitmap array}
    result = data_dict["value"] * broadcasts["scale"]
    return {"output": result}

udf = make_mapinarrow_func(compute, broadcasts={"scale": 2.0})
df_out = df_in.mapInArrow(udf, output_schema)

See test/demo_map_in_arrow.py for a complete runnable example.

Compatibility

Dependency Versions
Python 3.10 – 3.12
numba 0.60 – 0.63
pyarrow 14 – 18
pyspark 3.3 – 3.x (optional)
pandas 1.5+ (optional)

Documentation

Full API documentation: numbarrow docs

License

See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

numbarrow-0.2.0-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file numbarrow-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: numbarrow-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for numbarrow-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ccc6953b31085a6ae0f437b798923ce0d3383e847cba925bdf8765bc9b6d73f9
MD5 53da3cd5c189c8d86ccd53c5ff2b17b9
BLAKE2b-256 cf5375346944f84cacbdfb059ccd3bec9aff38afb2095f1e6a44e7dcd03f07de

See more details on using hashes here.

Provenance

The following attestation bundles were made for numbarrow-0.2.0-py3-none-any.whl:

Publisher: numbarrow_release.yml on Goykhman/numbarrow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page