Skip to main content

Memory mapping of datasets with arbitrary shapes

Project description

Memmpy

Memmpy is a Python library for storing datasets in, and loading datasets from, memory mapped files. This is particularly useful for large datasets that do not fit in memory and therefore need to be processed in batches. Memmpy is based on the numpy.memmap implementation.

Who should use Memmpy?

Memmpy is primarily intended for use in medium to large scale machine learning applications in high energy particle physics, where the whole dataset would not fit into memory at once and iterating over the ROOT files is too slow. This could be because shuffling of datapoints is desired, or because only a fraction of the information or events is needed for training.

Memmpy is not intended for use in small applications where the entire dataset fits into memory and can be loaded at once. It is also not intended for use in very large applications where training is massively distributed.

Installation

Memmpy can be installed directly from PyPI using pip. Memmpy requires Python 3.10 or higher. If you want to process .root files, uproot is required. This can also be installed using pip.

pip install memmpy

Usage

A simple memory mapped file can be created as follows:

with WriteVector(path="data.mmpy", key="testdata") as memfile:
    # Append a single numpy array.
    # The shape and dtype will be inferred from the array.
    memfile.append(np.array([1, 2, 3]))
    
    # Append another numpy array of the same shape and dtype
    memfile.append(np.array([4, 5, 6]))

    # Extend the file by an array with an additional axis.
    memfile.extend(np.array([[7, 8, 9], [10, 11, 12]]))

memmap_data = read_vector(path="data.mmpy", key="testdata")

The mempy library also provides functionality to store jagged arrays or arrays with arbitrary shape using the WriteJagged, ReadJagged, WriteShaped and ReadShaped classes.

Loading

A collection of memory mapped files can be loaded in batches using the SimpleLoader and SplitLoader. The SplitLoader also provides functionality for shuffling the dataset and splitting it into training and validation sets.

loader = SplitLoader(
    # provide a dict of memmap, ReadJagged or ReadShaped
    data={"first_memmap": memmap, ...},  
    batch_size=128,
    shuffle=True,
)
    
for batch in loader:
    ...

Filtering

Datasets can be filtered using the compute_cut_batched function.

subindicies = compute_cut_batched(
    path="data.mmpy",
    expression="testdata > 5"
)

The subindicies can be used to load only the filtered dataset, by passing them to the SplitLoader. All computed cuts are automtatically cached.

Processing ROOT files

To use memmpy with ROOT files, the uproot module is required. It can be installed using pip. The load_root function provides all-in-one functionality to load (multiple) ROOT files into memory. An example is shown below.

loader = load_root(
    root_files=[
        RFileConfig(
            path="ttbb_mc16a.root",
            tree="tree1",
            metadata={"process": "ttbb", "year": "mc16a"},
        ),
        RFileConfig(
            path="ttH_mc16d.root",
            tree="tree2",
            metadata={"process": "ttH", "year": "mc16d"},
        ),
    ],
    path_mmap="data.mmpy",
    keys={"nJets", "nBTags_77"},
    # variable length arrays can be padded to the same length with a given value
    keys_padded={"jet_pt": (22, float("nan"))}, 
    batch_size=128,
    tcut="nJets >= 6 & nBTags_77 >= 2",
)

All results are cached, so the next time the function is called, the dataset is loaded from the cache instead of the ROOT files. The metadata is stored in hashed form, so it is also possible to apply cuts to the metadata. Changing any of the ROOT files on disk will invalidate the cache, so the dataset will automatically be reloaded from the ROOT files the next time the function is called.

Metadata

The metadata is stored in a separate json file, located at `path/metadata_memmpy.json'. This includes the shapes and dtypes of the arrays, as well as checksums or timestamps that are calculated when the dataset is saved. The supplied ROOT file metadata is also stored, along with its hashes, so that the metadata can be fully reconstructed from the stored memory mapped files.

See also

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

memmpy-0.1.2.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

memmpy-0.1.2-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file memmpy-0.1.2.tar.gz.

File metadata

  • Download URL: memmpy-0.1.2.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for memmpy-0.1.2.tar.gz
Algorithm Hash digest
SHA256 38f55cd10b8582374bb4fb4c58c2cc62bd1203d93b1f7e9e691a5dd86695c721
MD5 917c58cb7b6dfea9da7334da9ec6c0ac
BLAKE2b-256 ac700f0221b4e6cd6c02f443763a840de8baa0b1e41cfdbb6c4d3c2b31156fd4

See more details on using hashes here.

File details

Details for the file memmpy-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: memmpy-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for memmpy-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2771fe221a8e08dfbc191d28190e3b3a3802fc17ea091c9ba0da39d070d24653
MD5 6087dbafaf7b7ddd1a9c95c97205979c
BLAKE2b-256 03ebbeee35e231411f4d6ad78c832bc5acacbd86b8ed73b00f53b0aac1dc4405

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page