Innovative package to optimize HEP analyses based on ROOT RDataFrame
Cut-based analyses in HEP foresee a series of operations that are more or less common:
- a dataset is accessed;
- cuts are performed on one or more variables;
- the variables of interest, whose events passed the above mentioned selections, are plotted.
This series of operations make what we can call a minimal analysis flow unit. In a typical HEP analysis, given the amount of datasets, cuts and plots that we want to produce, we have to handle hundreds of thousands of these units. Two basic examples of analysis units are here shown. To produce their results, many physicists use the so called TTree::Draw approach: it is simple because it allows not to write the event loop explicitly, but it has the drawback that the analysis units are run sequentially. This leads to the following weak spots:
- same datasets (e.g. Dataset-X above) are fetched and decompressed multiple times;
- same subsets of cuts (e.g. Selection-A above) are applied many times on different datasets;
- the event loop is run once per histogram.
Implementing a clean and friendly API to RDataFrame, NTupro allows to write minimal analysis flow units and automatically optimizes them, performing common operations only once, like in the following.
NTupro is divided into three main parts, here briefly summarized and described in detail in the following subsections:
- Book Results: for every histogram that we want to produce, we declare initial dataset, cuts, weights and systematic variations that need to be applied;
- Optimize Computations: datasets, selections and histograms are treated as nodes of a graph; the common once are merged to perform every action only once;
- Run Computations: the previous graphs are converted to the language of RDataFrame and the event loop is run; multiprocessing and multithreading facilities are here implemented.
In this first abstraction layer, the idea is to book all the histograms that we want to produce, declaring for each of them the set of ntuples from which it is taken, the set of cuts and weights applied and the systematic variations. To do so, the following classes are introduced:
Dataset: structure containing a set of ntuples names (see class
NTuplebelow) where we expect to fetch the variable we want to plot;
class Dataset: name = 'dataset_name' ntuples = [ntuple1, ntuple2, ...] class Ntuple: path = 'path_to_file.root' directory = 'tree_name' # Friends are other instances of Ntuple friends = [friend1, friend2, ...]
Selection: structure containing two lists, one for the cuts we want to apply to the dataset and one for the weights applied to the histogram; organizing cuts and weights in these logical blocks makes sense from the analysis point of view, since it allows to encode the physics knowledge about channels, processes, ecc.; more than one selection can be applied;
class Selection: name = 'selection_name' cuts = [('cut1_name', 'cut1_expression'), ...] weights = [('weight1_name', 'weight1_expression'), ...]
Action: structure that represents the results that can be extracted when the event loop is run; the two classes that inherit from it are
Count, the former containing the variable we want to plot and the list of edges of the histogram, the latter representing the sum of weights.
class Action: name variable class Count(Action): pass class Histogram(Action): name = 'histogram_name' variable = 'variable_to_plot' edges = [edge1, edge2, ...]
Instances of the above mentioned classes are passed as arguments to the class
Unit, which represents a minimal analysis flow unit, i.e. dataset where the events are stored, selections applied and actions we want to perform.
class Unit: dataset = dataset_object selections = [selection1, selection2, ...] actions = [histogram1, histogram2, ...]
In this step also the systematic variations are booked. The base class is
Variation, from which the following classes inherit:
ChangeDataset: create a copy of the target
Unitobject with a different
class ChangeDataset: name = 'variation_name' folder_name = 'new_dataset_name' def create(target_unit): return Unit(new_dataset, target_unit.selections, target_unit.actions)
RemoveWeight: create a copy of the target
Unitobject with a different (or one more)
Selectionblock containing a different list of weights;
RemoveCut: create a copy of the target
Unitobject with a different (or one more)
Selectionblock containing a different list of cuts.
In general, the way systematic variations operate is to create copies of
Unit objects with some differences in the blocks they are made of. Units are managed and booked, along with the systematic variations, by setting a
class UnitManager: booked_units =  def book_units(units, variations): # Book units and apply variations
In this stage, the goal is to merge the Units (paths) into directed graphs. The blocks that make the Units introduced in the previous part (i.e. Datasets, Selections and Actions) are treated as nodes of a graph. The common ones are merged in order to perform every action only once. At the end of this step, we end up with a set of trees. It is worth pointing out that there is a one-way relationship between graphs and datasets at the end of this step, i.e. we do not have two graphs with the same
Three levels of optimization are implemented:
- optimization 0: no optimization is implemented and the new software behaves like the current one;
- optimization 1: only
Datasetnodes are merged;
- optimization 2: both
Selectionnodes are merged.
These steps bring a different amount of improvement.
In this stage the ROOT facilities come into play. The optimized graphs created in the previous stage are converted into RDataFrame computational graphs. More specifically, each node of an abstract graph corresponds to a RDataFrame node type (e.g.
Histo1D, etc.). The recursive function returns a list of pointers to the histograms for each graph. The event loop is run only at the end, once for each graph.
In this stage two parallelization techniques are introduced:
- multithreading is enabled with a call to the function
- multiprocessing is enabled with the homonymous Python package; in this fashion, a pool of workers is set and the RDataFrame objects on which the event loop has to be run are sent one by one to them; when one of the workers is done, it gets the next object in the buffer.
In the following, we report a simple (and completely unrealistic) example that produces three histograms after the application of two systematic variations.
from ntuple_processor import Dataset, Unit, UnitManager, GraphManager, RunManager """Create a Dataset my_ntuples is a list of Ntuple objects """ my_dataset = Dataset('my_dataset', my_ntuples) """Create a Unit Remember: a Unit is made by the following elements: Dataset - [Selections] - [Histograms] """ my_unit = Unit(my_dataset, [selection1, selection2], [histo_var1]) # Set a Unit manager um = UnitManager() # Book Units and apply systematic varations um.book([my_unit], [sys_variation1, sys_variation2]) # Create graphs from Units graph_manager = GraphManager(um.booked_units) graph_manager.optimize() # Run - Convert to RDataFrame run_manager = RunManager(graph_manager.graphs) run_manager.run_locally('file.root', nworkers = 1, nthreads = 2)
Before merging, check that all the tests are green by running
$ python -m unittest -v
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size ntupro-0.0.0-py3-none-any.whl (21.0 kB)||File type Wheel||Python version py3||Upload date||Hashes View|
|Filename, size ntupro-0.0.0.tar.gz (20.5 kB)||File type Source||Python version None||Upload date||Hashes View|