Skip to main content

Standardised ML input processing for particle physics

Project description

PyNuML: HDF5 IO and ML processing for neutrino physics

PyNuML is a Python toolkit for processing machine learning (ML) inputs from neutrino physics simulation datasets. It offers efficient MPI parallel processing of datasets, including standardised solutions for generating semantic and instance labels from low-level particle simulation, and constructing PyTorch ML inputs such as pixel maps and graphs. The package uses a modular design to maximise flexibility and extensibility, allowing the user to write custom labelling and/or object formation code in place of the algorithms provided.

Parallel Event IO

HDF5 files produced using the NuML standard contain tabular data structures representing events, simulated particles, energy depositions, detector hits and any other information defined by the user. For large datasets, accessing the rows of a table corresponding to a specific event based on event index can become prohibitively slow. PyNuML includes a metadata standard for efficient MPI parallel IO with large-scale physics event data. This approach enables very efficient processing of datasets using MPI parallel processing on HPC nodes, while also providing a simple and effective interface for interactive analysis.

Semantic and instance labelling

Novel ML techniques developed for particle physics typically conform to one of several standard archetypes: event classification, instance segmentation to cluster of detector hits into particles, or semantic segmentation of hits and/or particles into particle types. These applications typically utilise supervised learning, leveraging the detailed simulation already available to produce truth-labelled ML objects for model training.

Most of these experiments utilise the same primary workflow: primary particles from a generator are passed into Geant4 to simulate true energy depositions, which are in turn passed through detector simulation to produce simulated raw detector output. Generating ML truth labels for detector objects such as hits typically involves backtracking from detector-level information to access the underlying true particle information, and using that information to design some kind of instance label.

Many physicists producing ML inputs develop such a workflow from scratch, unnecessarily re-developing variants on the same basic mechanism over and over again, and often falling into the same pitfalls in the process. For instance, a user producing a CNN pixel map from detector hits will often loop over each hit, query a backtracker to fetch the associated true particle information, and then use that information to categorise that hit according to a user-defined semantic labelling scheme. This approach can become highly inefficient and convoluted as computational cycles are wasted re-categorising hits produced by the same simulated particle, especially if the labelling requires context information from parent or child particles.

PyNuML maximises efficiency by performing a single labelling pass over the true particle table, stepping hierarchically down from primary particles, assigning each particle a semantic and instance label using a standard taxonomy. These labels can then be efficiently propagated to detector objects using Pandas DataFrame merge operations, using the true energy deposition table as an intermediary. This also avoids double-counting errors that can occur when aggregating objects into pixel or voxel maps is necessary.

If the user's simulation includes custom Geant4 physics processes that necessitate modifications to a standard labelling scheme – or if they simply prefer a different labelling scheme altogether – the user can simply write their own labelling function to use instead. If the user develops a new labelling function that has general appeal, that function can then be added to the standard labelling options included in PyNuML.

ML object formation

PyNuML also provides standard tools for the production of ML inputs, taking Pandas DataFrames containing event information and using them to construct a single ML input. A function that produces detector hit graphs for GNN training is provided, with 2D and 3D pixel map production in development. This single-event processing function is nested within an MPI parallel IO infrastructure to efficiently preprocess an entire dataset into ML inputs at scale, storing each object in an individual Pytorch .pt format or (experimentally) storing all inputs as compound data objects within a single HDF5 file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pynuml-0.1.0.tar.gz (22.8 kB view details)

Uploaded Source

Built Distribution

pynuml-0.1.0-py3-none-any.whl (23.5 kB view details)

Uploaded Python 3

File details

Details for the file pynuml-0.1.0.tar.gz.

File metadata

  • Download URL: pynuml-0.1.0.tar.gz
  • Upload date:
  • Size: 22.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for pynuml-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bdb0a56a3bc2bbf527ca0522637d2d3dad0e704ac5f5497764490cdbbbf287ae
MD5 c49b0e62d4ce01e1c0c994abd1bb078a
BLAKE2b-256 450cbf643116c68357b60639e20e01ca2d4bf280a837dfed33616eef5aa6491a

See more details on using hashes here.

File details

Details for the file pynuml-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pynuml-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for pynuml-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a6d44b32eeef4c15f372b1a8621d2f7a1bc2b00c67273708c2ed3d800697849d
MD5 93b84a461770e5b45b4c4b357d921059
BLAKE2b-256 ca8d66d159c464bac26b6a260925b18a1d043a33e01a2cd1d755a040146fb64f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page