Skip to main content

Hierarchical numpy memmap datasets for Python

Project description

HMF (Hierarchical Memmap Format) is a Python package that provides user API similar to that of PyTables but uses Numpy memmap for data storage. It also supports easy data sourcing from Pandas dataframe, as well as parallel writing for fast write speed.

Install

pip install hierarchical-memmap-format

Getting started

The HMF APIs are largely inspired by those of PyTables, and hence supports two of the important functionalities of HDF5 in that they allow the user to write data that is self-organizing and self-documenting. We will demonstrate these ideas through an example.

First, we need to import the package:

import HMF

In order to start working with the HMF, we must invoke open_file method, which will either create a new directory or read from an existing one, which we determine through mode argument. Note that even though it is called open “file”, the word file is loosely used to mean “directory”. As such, we must provide the method with the desired path to the root directory, via root_path argument, where all data will be written:

f = HMF.open_file('myRoot', mode='w+')

Currently, the supported modes are w+ and r+. w+ opens a directory for write. It creates a new directory if it does not exist, and if it exists, it erases the contents of that directory. r+ is for reading and writing, and will read the existing directory contents if it already exists.

Once you are done writing data (reading and writing process is described below), it is very important that you invoke close method to save all the data on disk:

f.close()

Writing groups and arrays

Here we will demonstrate the self-organizing property of HMF. With a single “file” handler, the user can easily write data using hierarchical file system. This will be easier to understand if you are already familiar with HDF5.

f.set_group('/groupA')  # the path must start with root "/"

This code will create a “directory”, or “node”, groupA, in which we can write arrays or further groups. The user can create nested directory at once as well:

f.set_group('/group1/groupA/groupZ')

We can write array using set_array method:

array = np.arange(9)
f.set_array('/groupA/array1', array)

You need not create the group ahead of time. If groupA does not exist, the above code will create the groupA as well. Also, most importatly, the above code will create a memory-map to the array, which you can find out more about here

You can retrieve both the groups as well as arrays using get_group and get_array methods. For example, the below code will retrieve the written array data:

memmap_obj = f.get_array('/groupA/array1')

The returned object is a numpy memmap object that was created earlier. Again, once you are done writing data, don’t forget to invoke close!

f.close()

Writing node attributes

Here we will demonstrate the self-documenting property of HMF. This again should be no suprise for those familiar with HDF5. HMF allows user to give attribute to each node, whether that is a group node or an array node. Let’s try to give some attributes to the groupA node from above.

f.set_node_attr('/groupA', key='someAttribute', value='attributeValue')

Both the key and value of the attribute can be arbitrary Python object.

You can then retrieve the attributes using get_node_attr method:

f.set_node_attr('/groupA', key='someAttribute')

Thus, HMF allows user to write data that is self describing by enabling user to easily read and write accompanying information associated with each node.

Using with Pandas

Lastly, HMF has API to easily extract array memmap from Pandas dataframes. Also, this mode of writing will be executed in parallel, i.e. all writable arrays will be written in parallel. Let’s look at an example, starting from beginning.

import numpy as np
import pandas as pd

data = np.arange(10*3).reshape((10, 3))
pdf = pd.DataFrame(data=data, columns=['a', 'b', 'c'])

f = HMF.open_file('pandasExample', mode='w+')

You first introduce the dataframe to HMF like so:

f.from_pandas(pdf)

You can then “register” arrays from the dataframe one by one:

f.register_array('arrayA', ['b', 'c'])
f.register_array('arrayB', ['a', 'b'])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hierarchical-memmap-format-0.0b5.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

hierarchical_memmap_format-0.0b5-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file hierarchical-memmap-format-0.0b5.tar.gz.

File metadata

  • Download URL: hierarchical-memmap-format-0.0b5.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.1

File hashes

Hashes for hierarchical-memmap-format-0.0b5.tar.gz
Algorithm Hash digest
SHA256 a9941d2aadbf076f2a58e9d667ee6981bcaf86c7c966693c19d3e5bf620a6e8f
MD5 d5043a8c38499d658d68351338387d16
BLAKE2b-256 86a3253d8bf86d241bc48063a3ab31ed6428b7e9c77a58f7f8f6786e556f288e

See more details on using hashes here.

File details

Details for the file hierarchical_memmap_format-0.0b5-py3-none-any.whl.

File metadata

  • Download URL: hierarchical_memmap_format-0.0b5-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.7.1

File hashes

Hashes for hierarchical_memmap_format-0.0b5-py3-none-any.whl
Algorithm Hash digest
SHA256 c10b43f462d61d42f7131e9aeaab2b5305a3ddfcbb180b67f3fdfbc6624c0149
MD5 874bd012a828bcc24725d61209d0c744
BLAKE2b-256 401e953be93fcda6f41c90e881933b271019c6aeaa35496b04591d9827b0f0a9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page