Skip to main content

Ragged array library, complying with Python API specification.

Project description

Ragged

Actions Status PyPI version PyPI platforms GitHub Discussion

Introduction

Ragged is a library for manipulating ragged arrays as though they were NumPy or CuPy arrays, following the Array API specification.

For example, this is a ragged/jagged array:

>>> import ragged
>>> a = ragged.array([[[1.1, 2.2, 3.3], []], [[4.4]], [], [[5.5, 6.6, 7.7, 8.8], [9.9]]])
>>> a
ragged.array([
    [[1.1, 2.2, 3.3], []],
    [[4.4]],
    [],
    [[5.5, 6.6, 7.7, 8.8], [9.9]]
])

The values are all floating-point numbers, so a.dtype is float64,

>>> a.dtype
dtype('float64')

but a.shape has non-integer dimensions to account for the fact that some of its list lengths are non-uniform:

>>> a.shape
(4, None, None)

In general, a ragged.array can have any mixture of regular and irregular dimensions, though shape[0] (the length) is always an integer. This convention follows the Array API's specification for array.shape, which must be a tuple of int or None:

array.shape: Tuple[Optional[int], ...]

(Our use of None to indicate a dimension without a single-valued size differs from the Array API's intention of specifying dimensions of unknown size, but it follows the technical specification. Array API-consuming libraries can try using Ragged to find out if they are ragged-ready.)

All of the normal elementwise and reducing functions apply, as well as slices:

>>> ragged.sqrt(a)
ragged.array([
    [[1.05, 1.48, 1.82], []],
    [[2.1]],
    [],
    [[2.35, 2.57, 2.77, 2.97], [3.15]]
])

>>> ragged.sum(a, axis=0)
ragged.array([
    [11, 8.8, 11, 8.8],
    [9.9]
])

>>> ragged.sum(a, axis=-1)
ragged.array([
    [6.6, 0],
    [4.4],
    [],
    [28.6, 9.9]
])

>>> a[-1, 0, 2]
ragged.array(7.7)

>>> a[a * 10 % 2 == 0]
ragged.array([
    [[2.2], []],
    [[4.4]],
    [],
    [[6.6, 8.8], []]
])

All of the methods, attributes, and functions in the Array API will be implemented for Ragged, as well as conveniences that are not required by the Array API. See open issues marked "todo" for Array API functions that still need to be written (out of 120 in total).

Ragged has two device values, "cpu" (backed by NumPy) and "cuda" (backed by CuPy). Eventually, all operations will be identical for CPU and GPU.

Implementation

Ragged is implemented using Awkward Array (code, docs), which is an array library for arbitrary tree-like (JSON-like) data. Because of its generality, Awkward Array cannot follow the Array API—in fact, its array objects can't have separate dtype and shape attributes (the array type can't be factorized). Ragged is therefore

  • a specialization of Awkward Array for numeric data in fixed-length and variable-length lists, and
  • a formalization to adhere to the Array API and its fully typed protocols.

See Why does this library exist? under the Discussions tab for more details.

Ragged is a thin wrapper around Awkward Array, restricting it to ragged arrays and transforming its function arguments and return values to fit the specification.

Awkward Array, in turn, is time- and memory-efficient, ready for big datasets. Consider the following:

import gc      # control for garbage collection
import psutil  # measure process memory
import time    # measure time

import math
import ragged

this_process = psutil.Process()

def measure_memory(task):
    gc.collect()
    start_memory = this_process.memory_full_info().uss
    out = task()
    gc.collect()
    stop_memory = this_process.memory_full_info().uss
    print(f"memory: {(stop_memory - start_memory) * 1e-9:.3f} GB")
    return out

def measure_time(task):
    gc.disable()
    start_time = time.perf_counter()
    out = task()
    stop_time = time.perf_counter()
    gc.enable()
    print(f"time: {stop_time - start_time:.3f} sec")
    return out

def make_big_python_object():
    out = []
    for i in range(10000000):
        out.append([j * 1.1 for j in range(i % 10)])
    return out

def make_ragged_array():
    return ragged.array(pyobj)

def compute_on_python_object():
    out = []
    for row in pyobj:
        out.append([math.sqrt(x) for x in row])
    return out

def compute_on_ragged_array():
    return ragged.sqrt(arr)

The ragged.array is 3 times smaller:

>>> pyobj = measure_memory(make_big_python_object)
memory: 2.687 GB

>>> arr = measure_memory(make_ragged_array)
memory: 0.877 GB

and a sample calculation on it (square root of each value) is 50 times faster:

>>> result = measure_time(compute_on_python_object)
time: 4.180 sec

>>> result = measure_time(compute_on_ragged_array)
time: 0.082 sec

Awkward Array and Ragged are generally smaller and faster than their Python equivalents for the same reasons that NumPy is smaller and faster than Python lists. See Awkward Array papers and presentations for more.

Installation

Ragged is on PyPI:

pip install ragged

and will someday be on conda-forge.

ragged is a pure-Python library that only depends on awkward (which, in turn, only depends on numpy and a compiled extension). In principle (i.e. eventually), ragged can be loaded into Pyodide and JupyterLite.

Acknowledgements

Support for this work was provided by NSF grant OAC-2103945 and the gracious help of Awkward Array contributors.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragged-0.2.0.tar.gz (55.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragged-0.2.0-py3-none-any.whl (46.8 kB view details)

Uploaded Python 3

File details

Details for the file ragged-0.2.0.tar.gz.

File metadata

  • Download URL: ragged-0.2.0.tar.gz
  • Upload date:
  • Size: 55.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for ragged-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0f934e0dab2ff42b98cf38912fd1d375fe0e54c7d155866c65f62c8df096ff9e
MD5 7f8f19ff97137447bc11c093cbdd3279
BLAKE2b-256 450894f4846b518e9522a4bab2f6405444c73a8851bfddefc403ae99056a48b3

See more details on using hashes here.

Provenance

The following attestation bundles were made for ragged-0.2.0.tar.gz:

Publisher: cd.yml on scikit-hep/ragged

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ragged-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: ragged-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 46.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for ragged-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 005e6098beb9ea4b997448a279e0b7ef4e7d36890f7b192ec7f66f9b8c71ae08
MD5 8a9442d96929890a3821b1f9e0011e4e
BLAKE2b-256 e7f0b014a8c47dea5c33ed1d9c52f1e6de0e78730e94226f2ba51474053b4ee0

See more details on using hashes here.

Provenance

The following attestation bundles were made for ragged-0.2.0-py3-none-any.whl:

Publisher: cd.yml on scikit-hep/ragged

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page