Columnar analysis utils.
Project description
F9 Columnar
🚧 Work in Progress ⚠️
Setup
User install
With PyTorch GPU
pip install f9columnar[torch]
With PyTorch CPU (recommended)
pip install f9columnar
pip install torch --index-url https://download.pytorch.org/whl/cpu
Without PyTorch
pip install f9columnar
Development install
Use poetry to install the required packages:
poetry config cache-dir $PWD
poetry config virtualenvs.in-project true
poetry install -E torch
This environment is duplicated for batch processing on dCache.
aCT
ARC Control Tower (aCT) is a system for submitting and managing payloads on ARC (and other) Computing Elements. It is used to submit jobs on sites in Slovenia.
Installation
Install aCT client from the repository with the following command to the virtual environment (or with poetry):
pip install "git+https://github.com/ARCControlTower/aCT.git@test#subdirectory=src/act/client/aCT-client"
The command act
is available in PATH
as the virtual environment is activated. See the scripts in the submit
directory for further details.
Voms proxy setup
Note that it is recommended to be in /atlas/si
group and make the proxy with it. Active it using (in a separate terminal):
setupATLAS
lsetup emi
voms-proxy-init --valid 96:0 --voms atlas:/atlas/si
To propagate the proxy to the system use
act proxy
At this point you are ready to use aCT.
Getting started example
The code is written in a way that it can be used standalone or as a part of an analysis containing data and MC samples. The following example shows the standalone usage.
Basic example
The main idea is to have a columnar event loop that returns arrays of events. The code and usage is the same as in a standard torch training loop over epochs, but instead of having epochs we iterate over batches of events.
from f9columnar.root_dataloader import get_root_dataloader
def filter_branch(branch):
# select only these two branches
return branch.name == "tau_p4" or branch.name == "lephad_p4"
# root_dataloader is an instance of a torch DataLoader that uses an IterableDataset
root_dataloader, total = get_root_dataloader(
ntuple_path, # path to the root file (file, list of paths or directory)
name="data", # name identifier
chunks="auto", # number of chunks to split the root file(s) into
setup_workers=4, # number of workers for initial setup
step_size="15 MB", # size of the step for the worker to read the root file
postifx="NOMINAL", # root file postfix
filter_branch=filter_branch, # filter branches
processors=None, # arbitrary calculations on arrays
num_workers=12, # number of workers for parallel processing
)
# loop over batches of events from .root file(s), each batch is an awkward array
for events in root_dataloader:
arrays, report = events
# ... do something with the arrays
Doing calculations on arrays inside of workers can be done using a Processor
. Many processors can be chained together into a ProcessorsGraph
(DAG) to perform more complex calculations. Processors are applied to the arrays in the order given by the topological sort of the DAG. Note that each worker runs the same processor graph on batches of array events and returns the result to the event loop when done. So in the above example there would be 12 (num_workers
) processor graphs running in parallel on small batches of events. An example of calculating tau visible mass and then applying a cut on this variable is shown below.
from f9columnar.processors import ProcessorsGraph, CheckpointProcessor
from f9columnar.object_collections import Variable, VariableCollection, Cut, CutCollection
from f9columnar.histograms import HistogramProcessor
class VisibleMass(Variable): # Variable is a Processor
name = "vis_mass" # processor name
branch_name = "lephad_p4" # name of the branch in the .root file
def __init__(self):
super().__init__()
def run(self, arrays): # each processor must implement a run method
lephad_p4 = arrays[self.branch_name] # branch_name is the name of the field in the ak array
v = get_kinematics_vector(lephad_p4) # use vector with px, py, pz and E
arrays["tau_vis_mass"] = v.m # add a new field to the arrays
return {"arrays": arrays} # return the arrays (can also return None if no changes are made)
class CutVisibleMass(Cut): # Cut is a Processor
name = "vis_mass_cut"
branch_name = None # is not a branch in ntuples but was defined in the VisibleMass processor
def __init__(self, cut_lower, cut_upper): # argumnets of the processor
super().__init__()
self.cut_lower = cut_lower
self.cut_upper = cut_upper
def run(self, arrays):
mask = (arrays["tau_vis_mass"] > self.cut_lower) & (arrays["tau_vis_mass"] < self.cut_upper)
arrays = arrays[mask] # apply the cut
return {"arrays": arrays} # return must be a dictionary with key name for the argument of the next processor
class Histograms(HistogramProcessor):
def __init__(self, name="histograms"):
super().__init__(name)
self.make_hist1d("tau_vis_mass", 20, 80.0, 110.0) # make a histogram with 20 bins from 80 to 110 GeV
def run(self, arrays):
return super().run(arrays) # auto fills histograms if array names match histogram names
var_collection = VariableCollection(VisibleMass, init=False) # will initialize later
cut_collection = CutCollection(CutVisibleMass, init=False)
collection = var_collection + cut_collection # add collections of objects together
branch_filter = collection.branch_name_filter # defines the branches that the processors depend on
graph = ProcessorsGraph() # graph has a fit method that gets called inside the root_dataloader
# add nodes to the graph
graph.add(
CheckpointProcessor("input"), # input node
var_collection["vis_mass"](), # initialize the processor
cut_collection["vis_mass_cut"](cut_lower=90.0, cut_upper=100.0),
CheckpointProcessor("output", save_arrays=True), # saves final arrays
Histograms(),
)
# build a processor graph
graph.connect(
[
("input", "vis_mass"),
("vis_mass", "vis_mass_cut"),
("vis_mass_cut", "output"),
("output", "histograms"),
]
)
# plot the graph
graph.draw("graph.pdf")
# ... pass into the root_dataloader with the processors argument (e.g. processors=graph)
# in this case the dataloader will return a fitted graph
for processed_graph in dataloader:
histograms = processed_graph["histograms"].hists
arrays = processed_graph["output"].arrays
# ... do something with the histograms and arrays
ROOT DataLoader schema
Development
Making a portable venv with conda
Make sure you have Miniconda installed:
mkdir miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda3/miniconda.sh
bash miniconda3/miniconda.sh -b -u -p miniconda3
rm -rf miniconda3/miniconda.sh
miniconda3/bin/conda init bash
init
command will add some path variables to your ~/.bashrc
that you can delete when done.
To test conda install use:
conda -V
Next, make a virtual environment:
conda create -n batch_venv python=3.12.4
source activate batch_venv
Install the required packages:
pip install f9columnar
pip install torch --index-url https://download.pytorch.org/whl/cpu
In order to make this environment portable use conda-pack:
conda install conda-pack
conda pack
conda deactivate
On remote machine unpack the environment:
tar -xzf batch_venv.tar.gz
source batch_venv/bin/activate
conda-unpack
dCache
Basic instructions can be found here.
To upload the above described venv to dCache use:
arccp batch_venv.tar davs://dcache.sling.si:2880/atlas/jang/
where you can make your own directory with arcmkdir
.
lxplus venv setup
Log into lxplus:
ssh <name>@lxplus.cern.ch
Since we want custom python packages and installing on afs
is not recommended, we will use eos
:
cd /eos/user/j/jgavrano
Source an LCG release to use as base:
setupATLAS
lsetup "views LCG_105b x86_64-el9-gcc13-opt"
Setup venv
and install required packages from requirements
:
PYTHONUSERBASE=/eos/user/j/jgavrano/F9Columnar/ pip3 install --user --no-cache-dir -r requirements.txt
Test with libraries in eos
:
PYTHONPATH=/eos/user/j/jgavrano/F9Columnar/lib/python3.9/site-packages/:$PYTHONPATH python3 <script_name>.py
Setup python with custom venv
:
export PYTHONPATH=/eos/user/j/jgavrano/F9Columnar/lib/python3.9/site-packages/:$PYTHONPATH
To make it public go to cernbox website and share it with atlas-current-physicists
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file f9columnar-0.1.15.tar.gz
.
File metadata
- Download URL: f9columnar-0.1.15.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.11.2 Linux/6.6.13-200.fc39.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 029ceba3583031030a664843c976f7af009c30e1dac723be05774b27dc1e867f |
|
MD5 | ea95c966c3b57d63dfae64d2e6e718ac |
|
BLAKE2b-256 | 941d17393854a4ee9a0d448e4280d7a5e9e8d361d812c03223b530eb6d4c8f13 |
File details
Details for the file f9columnar-0.1.15-py3-none-any.whl
.
File metadata
- Download URL: f9columnar-0.1.15-py3-none-any.whl
- Upload date:
- Size: 1.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.11.2 Linux/6.6.13-200.fc39.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 78cd1915d8881669a5ed7fa3365d7ca6ee79a75c46082a418b3d03c54c627df2 |
|
MD5 | 10997980c2bb448b6a389abe22777d6a |
|
BLAKE2b-256 | ae93f7eb88f5f6d22a2e758b15f19d4ef003efceeafe4f5262422307630ed9e8 |