No project description provided
Project description
numbarrow
Numba adapters for PyArrow and PySpark.
numbarrow lets you work with Apache Arrow arrays directly inside Numba @njit compiled functions. It converts PyArrow arrays into NumPy views (zero-copy where possible) and extracts validity bitmaps for null handling — bridging PySpark's Arrow-based batch processing with high-performance JIT-compiled code.
Installation
pip install numbarrow
Optional dependencies for PySpark and pandas support:
pip install numbarrow[test] # adds pyspark
pip install numbarrow[mapinarrow] # adds pandas
Quick Start
import pyarrow as pa
from numba import njit
from numbarrow.core.adapters import arrow_array_adapter
from numbarrow.core.is_null import is_null
# Convert a PyArrow array to NumPy for use in @njit
arrow_array = pa.array([10, None, 30, 40], type=pa.int32())
bitmap, data = arrow_array_adapter(arrow_array)
@njit
def sum_non_null(data, bitmap):
total = 0
for i in range(len(data)):
if bitmap is None or not is_null(i, bitmap):
total += data[i]
return total
result = sum_non_null(data, bitmap) # 80
Supported Types
| PyArrow Type | NumPy Result | Copy? |
|---|---|---|
Int32Array, Int64Array, DoubleArray |
Matching dtype | No (view) |
BooleanArray |
bool_ |
Yes (bit-unpacking) |
Date32Array |
datetime64[D] |
Yes (int32 → int64) |
Date64Array |
datetime64[ms] |
No (view) |
TimestampArray |
datetime64[unit] |
No (view) |
StringArray |
Fixed-width Unicode (bitmap not returned) | Yes (repacking) |
StructArray |
Tuple of two dicts: (bitmaps, data) per field | Per-field |
ListArray (of structs) |
Tuple of two dicts: (bitmaps, data) per field | Per-field |
PySpark Integration
Use make_mapinarrow_func to create functions compatible with PySpark's mapInArrow:
from numbarrow.core.mapinarrow_factory import make_mapinarrow_func
def compute(data_dict, bitmap_dict, broadcasts):
# data_dict: {col_name: np.ndarray}
# bitmap_dict: {col_name: uint8 bitmap array}
result = data_dict["value"] * broadcasts["scale"]
return {"output": result}
udf = make_mapinarrow_func(compute, broadcasts={"scale": 2.0})
df_out = df_in.mapInArrow(udf, output_schema)
See test/demo_map_in_arrow.py for a complete runnable example.
Compatibility
| Dependency | Versions |
|---|---|
| Python | 3.10 – 3.12 |
| numba | 0.60 – 0.63 |
| pyarrow | 14 – 18 |
| pyspark | 3.3 – 3.x (optional) |
| pandas | 1.5+ (optional) |
Documentation
Full API documentation: numbarrow docs
License
See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file numbarrow-0.2.0-py3-none-any.whl.
File metadata
- Download URL: numbarrow-0.2.0-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccc6953b31085a6ae0f437b798923ce0d3383e847cba925bdf8765bc9b6d73f9
|
|
| MD5 |
53da3cd5c189c8d86ccd53c5ff2b17b9
|
|
| BLAKE2b-256 |
cf5375346944f84cacbdfb059ccd3bec9aff38afb2095f1e6a44e7dcd03f07de
|
Provenance
The following attestation bundles were made for numbarrow-0.2.0-py3-none-any.whl:
Publisher:
numbarrow_release.yml on Goykhman/numbarrow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
numbarrow-0.2.0-py3-none-any.whl -
Subject digest:
ccc6953b31085a6ae0f437b798923ce0d3383e847cba925bdf8765bc9b6d73f9 - Sigstore transparency entry: 1245021597
- Sigstore integration time:
-
Permalink:
Goykhman/numbarrow@c4692e7113044fa32b353d60a0b34ca9d9a45b8b -
Branch / Tag:
refs/tags/0.2.0 - Owner: https://github.com/Goykhman
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
numbarrow_release.yml@c4692e7113044fa32b353d60a0b34ca9d9a45b8b -
Trigger Event:
release
-
Statement type: