Numpy columnar storage for IceCube data

Project description

i3cols

Convert IceCube i3 files to columnar Numpy arrays & operate on them.

Motivation

IceCube .i3 files are formulated for arbitrary event-by-event processing of "events." The information that comprises an "event" can span multiple frames in the file, and the file must be read and written sequentially like linear tape storage (i.e., processing requires a finite state machine). This is well-suited to "online" and "real-time" processing of events.

Additionally, the IceTray, which is meant to read, process, and produce .i3 files, can process multiple files but is unwaware of file boundaries, despite the fundamental role that "number of files" plays in normalizing IceCube Monte Carlo simulation data.

Beyond collecting data, performing event splitting, and real-time algorithms, analysis often is most efficient and straightforward to work with specific features of events atomically: i.e., in a columnar fashion.

Two existing options are writing data to ROOT files and HDF5 files. The former requires a learning curve of its own outside of Python/Numpy, though possibly through projects like Uproot, such files could be more practical to use for Python-based analysis. HDF5 files suffer from the inability to be manipulated in parallel from Python and present an unessential layer of complexity around what we actually want and use: Numpy arrays of data.

i3cols allows working with IceCube data in columnar fashion without added complexity of libraries beyond Numpy, and both the columnar nature and the simplicity of the data representation allows more straightforward as well as new and novel ways of interacting with data which should be more natural and efficient for many common use-cases in Python/Numpy:

Basics

Apply numpy operations directly to data arrays
Extract data columns to pass to machine learning tools directly or at most with simple indexing; e.g., to extract the azimuthal direction of all particles: particles["dir"]["azimuth"].
Memory mapping allows chunked operations and arbitrary reading and/or writing to the same arrays from multiple processes (just make sure the processes don't try to write to the same elements).
New levels of processing entail adding new columns, without the need to duplicate all data that came before. This has the disadvantage that cuts that remove the vast majority of events result in columns that have the same number of rows as the pre-cut. However, the advantage to working this way is that it is trivial to go back to the lowest-level of processing and also to inspect how the cuts affect all variables contained at any level of processing. (A future goal is to also accommodate efficiently storing a subset of rows by creating a "subscalar" column that only contains rows where its or another column's "valid" array is True.)
Numpy arrays with structured dtypes can be passed directly to Numba for efficient looping with similar performance to compiled C++ code, as Numba is just-in-time (JIT)-compiled to machine code.
There is no dependency upon IceCube software once data has been extracted. This can be seen as a positive (portability) and a negative (IceCube software has been developed, in part, to perform analysis tasks).
If you think of a new item you want to extract from the source i3 files after already performing an extraction, it is much faster and the process only yields the new column (rather than an entirely new file, as is the case with HDF5 extraction).
Basic operations like transferring specific frame items can be achieved with simple UNIX utilities (cp, rsync, etc.)
A future goal is to implement versioning with each column that is either explicitly accessible (if desired) or transparent (if desired) to users, such that different processing versions can live side-by-side without affecting one another.

Flattening hierarchies for fast analysis without losing hierarchies

Source i3 files are invisible to analysis, if you want (the fact that data came from hundreds of thousands of small i3 files does not slow down analysis or grind cluster storage to a halt). If the concatenated arrays are too large to fit into memory, you can operate on arrays in arbitrary chunks of events and/or work directly on the data on disk via Numpy's built-in memory mapping (which is transparent to Numpy operations).
Source i3 files can be explicitly known to analysis if you want (via the Numpy arrays called in i3cols "category indexes").
Flattened datastructures allow efficient operations on them. E.g., looking at all pulses is trivial without needing to traverse a hierarchy. But information about the hierarchy is preserved, so operating in that manner is still possible (and still very fast with Numpy and/or Numba).

Data storage

There are two kinds of data currently supported:

Scalar data: One item per event. This includes arbitrarily complex dtypes, including a vector of N of a thing. The only requirement is that every event must also have exactly N of those things. This requires a data array which has one entry per event.
Vector data: Zero, one, or more of an item per event. E.g., pulses. This requires both data and index arrays. The index array has one entry per event which indicates the start and stop indices of that event's data. Meanwhile, data can be arbitrarily long.

To accomodate arbitrary missing or invalid data, there can be an optional valid array alongside the other arrays.

One column consists of the container (either a directory or .npz file) and the set of the above arrays it contains. The directory version of this looks on disk like, for example for InIcePulses (which is vector data) with a valid array:

InIcePulses/
  |
  +-- data.npy
  |
  +-- valid.npy
  |
  +-- index.npy

If the above is compressed, it becomes a single .npz file: InIcePulses.npz. Within the file, the arrays can be accessed via their names "data", "valid", and/or "index". Here, we load all arrays that are present into a dictionary:

with np.load("InIcePulses.npz") as npz:
    array_d = {name: npz.get(name, None) for name in ["data", "valid", "index"]}

Note there is a convenience function to do this (and load any category indices) for a single item or for an entire directory, transparent to compression: arrays, category_indexes = i3cols.cols.load(dirpath).

Unsupported (or awkwardly supported) data types

Mixture of types: Scalar data where the item in one event has one type while another event (for that same item) has another type; similarly for vector data where dtype changes either within or across events. E.g., if the datatype contains a different-length-per-event vector as one field within the type (while all other parts of the type are constant-length). This can be accommodated (to varying degrees of ineffficiency) by creating a type that contains all fields and just one of the vector type, and then duplicating the fields that remain constant for every value in the vector.

Installation

pip install i3cols

For developers

Clone the repository and then perform an editable (pip install -e) installation:

git clone git@github.com:jllanfranchi/i3cols.git
pip install -e ./i3cols

Examples

Extracting data from I3 files

All command-line examples assume you are using BASH; adapt as necessary for your favorite shell.

Extract a few items from all Monte Carlo run 160000 files (nutau GENIE simulation performed for oscNext), concatenating into single column per item:

find /tmp/i3/genie/level7_v01.04/160000/ -name "oscNext*.i3*" | \
    sort -V | \
    i3cols extract_files_separately \
        --keys I3EventHeader I3MCTree I3MCWeightDict \
        --index-and-concatenate \
        --category-xform subrun \
        --procs 20 \
        --overwrite \
        --outdir /tmp/columnar/genie/level7_v01.04/160000

If you completed the above and realize you also want the I3GENIEResultDict, then you can re-run the above but just specify that key. The process should be much faster:

find /tmp/i3/genie/level7_v01.04/160000/ -name "oscNext*.i3*" | \
    sort -V | \
    i3cols extract_files_separately \
        --keys I3GENIEResultDict \
        --index-and-concatenate \
        --category-xform subrun \
        --procs 20 \
        --overwrite \
        --outdir /tmp/columnar/genie/level7_v01.04/160000

Extract all keys from IC86.11 season. All subrun files for a given run are combined transparently into one and then all runs are combined in the end into monolithic columns, with a run__category_index.npy created in outdir that indexes the columns by run:

i3cols extract_season \
    /tmp/i3/data/level7_v01.04/IC86.11/ \
    --index-and-concatenate \
    --gcd /data/icecube/gcd/ \
    --overwrite \
    --outdir /tmp/columnar/data/level7_v01.04/IC86.11 \
    --compress

Things to note in the above:

You can specify paths on the command line, or you can pipe the paths to the function. The former is simple for specifying one or a few paths, but UNIX command lines are limited in total length, so the latter can be the only way to successfully pass all paths to i3cols (and UNIX pipes are nice ayway; note the numerical-version-sorting performed inline via find <...> | sort -V | i3cols ... in the first example).
Optional compression of the resulting column directories (a directory + 1 or more npy arrays within) can be performed after the extraction. Memory mapping is not possible with the compressed files, but significant compression ratios are achievable.
Extraction is performed in parallel where possible. Specify --procs to limit the number of subroccesses; otherwise, extraction (and compression/decompression) will attempt to use all cores (or hyperthreads, where available) on a machine.

Working with the extracted data

Extracted data is output to Numpy arrays, possibly with structured Numpy dtypes.

import numba

from i3cols import cols, phys

@numba.njit(fastmath=True, error_model="numpy")
def get_tau_info(data, index):
    """Return indices of events which exhibit nutau regeneration and return a
    dict of decay products of primary nutau.
    """

    tau_regen_evt_indices = numba.typed.List()
    tau_decay_products = numba.typed.Dict()
    for evt_idx, index_ in enumerate(index):
        flat_particles = data[index_["start"] : index_["stop"]]
        for flat_particle in flat_particles:
            if flat_particle["level"] == 0:
                if flat_particle["particle"]["pdg_encoding"] not in phys.NUTAUS:
                    break
            else:
                pdg = flat_particle["particle"]["pdg_encoding"]
                if flat_particle["level"] == 1:
                    if pdg not in tau_decay_products:
                        tau_decay_products[pdg] = 0
                    tau_decay_products[pdg] += 1
                if pdg in phys.NUTAUS:
                    tau_regen_evt_indices.append(evt_idx)
    return tau_regen_evt_indices, tau_decay_products


# Load just the I3MCTree (regardless of presence other columns), memory-mapped 
arrays, category_indexes = cols.load("/tmp/columnar/genie/level7_v01.04/160000", keys="I3MCTree", mmap=True)

# Get the info!
tau_regen_evt_indices, tau_decay_products = get_tau_info_nb(**arrays["I3MCTree"])

Project details

Release history Release notifications | RSS feed

0.1.4

Apr 28, 2020

This version

0.1.3

Apr 14, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

i3cols-0.1.3.tar.gz (62.9 kB view hashes)

Uploaded Apr 14, 2020 Source

Built Distribution

i3cols-0.1.3-py3-none-any.whl (59.7 kB view hashes)

Uploaded Apr 14, 2020 Python 3

Hashes for i3cols-0.1.3.tar.gz

Hashes for i3cols-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`a27f5775ff352630d2b2020ac1946f81843f34af736970cb3dff7f6f786c787e`
MD5	`44f8802b0db68557110281e8bd9835c2`
BLAKE2b-256	`4704b8f0791a5eb29fb25544ee6c5de3fc9f46199bd086dedb837192a8fe638f`

Hashes for i3cols-0.1.3-py3-none-any.whl

Hashes for i3cols-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`14721c86dbbcec9257cbbbba10d16d4563bec9dcf700b114013b0c7cef705eb1`
MD5	`db4dbefdc60aa365e6c85a516fbab00d`
BLAKE2b-256	`db82e98285d7e2d381ba49b6182739442f2b4b39d8823163d8b3c25f0e009e2f`