Tools for analyzing hierarchically nested, columnar data without deserialization.
Data analysts are often faced with a choice between speed and flexibility. Tabular data, such as SQL tables, can be processed rapidly enough for a truly interactive analysis session, but hierarchically nested data, such as JSON, is better at representing relationships within complex data models. In some domains (such as particle physics), we want to perform calculations on JSON-like structures at the speed of SQL.
The key to high throughput for large datasets, particularly ones with many more attributes than are typically accessed in a single pass, is laying out the data in “columns.” All values of an attribute should be contiguous in disk or memory because data are paged from one cache to the next in locally contiguous blocks. The ROOT and Parquet file formats represent hierarchically nested data in a columnar form on disk, and Apache Arrow is an emerging standard for sharing hierarchically nested data in memory. However, data from these formats are usually deserialized into conventional data structures before processing, which limits performance (see this talk and this paper).
OAMap is a toolset for computing arbitrary functions on hierarchical, columnar data without deserialization. The name stands for Object Array Mapping, in analogy with the Object Relational Mapping (ORM) interface to some databases. Users (data analysts) write functions on JSON-like objects that OAMap compiles into operations on the underlying arrays, similar to the way that ORM converts object-oriented code into SQL. The difference is that mapping objects to non-relational arrays permits bare-metal performance (giving up some traditional database features).
OAMap is a Python library on top of which high-level analysis software may be built. It focuses on mapping an object-oriented view of data onto columnar arrays. Numpy is OAMap’s only strict dependency, though OAMap objects are also implemented as Numba extensions, so they may be used in Numba’s JIT-compiled functions at speeds that match or exceed hand-written C code. OAMap is unopinionated about the source of its columnar arrays, allowing for a variety of backends. See the walkthrough for more.
Also, a similar object array mapping could be implemented in any language— Python was chosen only for its popularity among data analysts.