Memory-mapped numeric arrays, based on a format that is self-explanatory and tool-independent
Darr is a Python library for disk-based NumPy arrays that persist in a format that is simple, self-documented and tool-independent. It keeps your data easily accessible from a wide range of computing environments. Every array is automatically documented, and includes code to read itself in languages such as R, Julia, IDL, Matlab, Maple, and Mathematica, or in Python/Numpy without Darr (see example). Keeping data universally readable and documented is a pillar of good scientific practice. More rationale for this approach is provided here.
Under the hood, Darr uses NumPy memory-mapped arrays, which is a widely established, tested and trusted way of working with disk-based numerical arrays, and which makes Darr fully NumPy compatible. It enables efficient out-of-core read/write access to potentially very large arrays. What Darr adds is that it automatically keeps your arrays fully documented, open, and thus widely readable. Further, Darr adds functionality to make your life easier in other ways, such as the support for ragged arrays, the ability to create arrays from iterators, append and truncate functionality, and the easy use of metadata.
Flat binary files and (JSON) text files are accompanied by a README text file that explains how the array and metadata are stored. It is trivially easy to share your data with others or with yourself when working in different computing environments because it always contains clear documentation, including code to read it. Does your colleague want to try out an interesting algorithm in R or Matlab on your array data? No need to export anything or to provide elaborate explanation. A copy-paste of a few lines of code from the documentation stored with the data is sufficient. No dependence on complicated formats or specialized libraries. Self-documentation and code examples are automatically updated as you change your arrays when working with them.
Darr is currently pre-1.0, still undergoing significant development. It is open source and freely available under the New BSD License terms.
Data is stored purely based on flat binary and text files, maximizing universal readability.
Automatic self-documention, including copy-paste ready code snippets for reading the array in a number of popular data analysis environments, such as Python (without Darr), R, Julia, Octave/Matlab, GDL/IDL, and Mathematica (see example array).
Disk-persistent array data is directly accessible through NumPy indexing and may be larger than RAM and that is easily appendable.
Supports ragged arrays.
Easy use of metadata, stored in a widely readable separate JSON text file.
Many numeric types are supported: (u)int8-(u)int64, float16-float64, complex64, complex128.
Integrates easily with the Dask library for out-of-core computation on very large arrays.
Minimal dependencies, only NumPy.
See the documentation for more information.
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.