Out-of-core NumPy arrays
Wendelin.core allows you to work with arrays bigger than RAM and local disk. Bigarrays are persisted to storage, and can be changed in transactional manner.
In other words bigarrays are something like numpy.memmap for numpy.ndarray and OS files, but support transactions and files bigger than disk. The whole bigarray cannot generally be used as a drop-in replacement for numpy arrays, but bigarray slices are real ndarrays and can be used everywhere ndarray can be used, including in C/Cython/Fortran code. Slice size is limited by virtual address-space size, which is ~ max 127TB on Linux/amd64.
The main class to work with is ZBigArray and is used like ndarray from NumPy:
from wendelin.bigarray.array_zodb import ZBigArray import transaction # root is connected to opened database root['A'] = A = ZBigArray(shape=..., dtype=...) transaction.commit()
view array as a real ndarray:
a = A[:] # view which covers all array, if it fits into address-space b = A[10:100]
data for views will be loaded lazily on memory access.
work with views, including using C/Cython/Fortran functions from NumPy and other libraries to read/modify data:
a = 1 a[10:20] = numpy.arange(10) numpy.mean(a)the amount of modifications in one transaction should be less than available RAM.the amount of data read is limited only by virtual address-space size.
data can be appended to array in O(δ) time:
values # ndarray to append of shape (δ,) A.append(values)
and array itself can be resized in O(1) time:
changes to array data can be either discarded or saved back to DB:
transaction.abort() # discard all made changes transaction.commit() # atomically save all changes
Please see demo/demo_zbigarray.py for a complete example.
Current state and Roadmap
Wendelin.core works in real life for workloads Nexedi is using in production, including 24/7 projects. We are, however, aware of the following limitations and things that need to be improved:
wendelin.core is currently not very fast
there are big - proportional to input in size - temporary array allocations in third-party libraries (NumPy, scikit-learn, …) which might practically prevent processing out-of-core arrays depending on the functionality used.
we are currently working on improved wendelin.core design and implementation, which will use kernel virtual memory manager (instead of one implemented in userspace) with arrays backend presented to kernel via FUSE as virtual filesystem implemented in Go.
In parallel we will also:
try wendelin.core 1.0 on large data sets
identify and incrementally fix big-temporaries allocation issues in NumPy and scikit-learn
We are open to community help with the above.
Wendelin.core change history
Add support for Python 3.7 (commit).
Fix logic to keep ZBigFileH in sync with ZODB connection (commit).
Fix crash on PyVMA deallocation (commit).
Enhance t/qemu-runlinux - utility that is used to work on combined kernel/user-space workloads (commit 1, 2, 3, 4, 5, 6). This was in particular useful to develop Linux kernel fixes that are needed for Wendelin.core 2.0 (kernel commit 1, 2, 3, 4, 5, 6, 7).
Add ArrayRef utility to find out for a NumPy array its top-level root parent and how to recreate the array as some view of the root; this builds the foundation for e.g. sending arrays as references without copy in CMFActivity joblib backend (commit 1, 2, 3).
Don’t crash on writeout if previously storeblk() resulted in error (commit).
Use zodbtools/zodburi, if available, to open database by URL (commit).
Switch back to using ZBlk0 format by default (commit)
Tell the world dtype=object is not supported (commit)
Do not leak memory when loading data in ZBlk1 format (commit).
Add support for O(δ) in-place BigArray.append() (commit)
Implement proper multithreading support (commit)
Add support for automatic BigArray -> ndarray conversion, so that e.g. the following:
A = BigArray(...) numpy.mean(A) # passing BigArray to plain NumPy function
either succeeds, or raises MemoryError if not enough address space is available to cover whole A. (current limitation is ~ 127TB on linux/amd64)
Various bugfixes (build-fixes, crashes, overflows, etc)
Add support for O(1) in-place BigArray.resize() (commit)
Various build bugfixes (older systems, non-std python, etc)
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.