hierarchical-memmap-format

Hierarchical numpy memmap datasets for Python

Project description

HMF (Hierarchical Memmap Format) is a Python package that provides user API similar to that of PyTables but uses Numpy memmap for data storage. It also supports easy data sourcing from Pandas dataframe, as well as parallel writing for fast write speed.

Install

pip install hierarchical-memmap-format

Getting started

First, we need to import the package:

import HMF

In order to start working with the HMF, we must invoke open_file method, which will either create a new directory or read from an existing one. We determine this using mode argument. We must provide the method with the desired path to the root directory, via root_path argument, where all data will be written:

f = HMF.open_file('myRoot', mode='w+')

Currently, the supported modes are w+ and r+. w+ opens a directory for write. It creates a new directory if it does not exist, and if it exists, it erases the contents of that directory. r+ is for reading and writing, and will read the existing directory contents if it already exists.

Once you are done writing data (reading and writing process is described below), it is very important that you invoke close method to save all the data on disk:

f.close()

Writing groups and arrays

With a single “file” handler, the user can easily write and read data using hierarchical file system path. An example will make this clear:

f.set_group('/groupA')  # the path must start with root "/"

This code will create a “node”, groupA, in which we can write arrays or further groups. The user can create nested directory at once as well:

f.set_group('/group1/groupA/groupZ')

We can write data array using set_array method:

array = np.arange(9).reshape(3, 3)

# array([[0, 1, 2],
#        [3, 4, 5],
#        [6, 7, 8]])

f.set_array('/groupA/array1', array)

You need not create the group ahead of time. If groupA does not exist, the above code will create the groupA as well. Also, most importatly, the above code will create a memory-map to the array, which you can find out more about here

Again, once you are done writing data, don’t forget to invoke close!

f.close()

Reading groups and arrays

You can retrieve both the groups as well as arrays using get_group and get_array methods. For example, the below code will retrieve the written array data:

memmap_obj = f.get_array('/groupA/array1')

# memmap([[0, 1, 2],
#         [3, 4, 5],
#         [6, 7, 8]])

The returned object is a numpy memmap object that was created earlier.

You can also use slice or fancy indexing to retrieve partial data using idx parameter:

f.get_array('/groupA/array1', idx=slice(0, 2))

# memmap([[0, 1, 2],
#         [3, 4, 5]])

Slicing will return view of the memmap.

f.get_array('/groupA/array1', idx=[0, 2])

# array([[0, 1, 2],
#        [6, 7, 8]])

Fancy indexing will return copy of the memmap.

Writing node attributes

Here we will demonstrate the self-documenting property of HMF. This again should be no suprise for those familiar with HDF5. HMF allows user to give attribute to each node, whether that is a group node or an array node. Let’s try to give some attributes to the groupA node from above.

f.set_node_attr('/groupA', key='someAttribute', value='attributeValue')

Both the key and value of the attribute can be arbitrary Python object.

You can then retrieve the attributes using get_node_attr method:

f.get_node_attr('/groupA', key='someAttribute')

Thus, HMF allows user to write data that is self describing by enabling user to easily read and write accompanying information associated with each node.

Using with Pandas

Lastly, HMF has API to easily extract array memmap from Pandas dataframes. Also, this mode of writing will be executed in parallel, i.e. all writable arrays will be written in parallel. Let’s look at an example, starting from beginning.

import numpy as np
import pandas as pd

data = np.arange(10*3).reshape((10, 3))
pdf = pd.DataFrame(data=data, columns=['a', 'b', 'c'])

#               a       b       c
#       0       0       1       2
#       1       3       4       5
#       2       6       7       8

f = HMF.open_file('pandasExample', mode='w+')

You first introduce the dataframe to HMF like so:

f.from_pandas(pdf)

You can then “register” arrays from the dataframe one by one:

f.register_array('arrayA', ['b', 'c'])
f.register_array('arrayB', ['a', 'b'])

Finally calling close to save the data:

f.close()

# Progress: |██████████████████████████████████████████████████| 100.0% Completed!

You can now retrieve the memmap object the usual way:

f.get_array('/arrayA')

# memmap([[1, 2],
#         [4, 5],
#         [7, 8]])

Parallel writing

The power of parallel writing shines when you have many arrays to write at once, which would be the case if you have groups of arrays determined by groupby argument. Let’s take another example of dataframe that has groups column:

import numpy as np
import pandas as pd

data = np.arange(10*3).reshape((10, 3))
pdf = pd.DataFrame(data=data, columns=['a', 'b', 'c'])

group_col = ['group_1', 'group_1', 'group_2', 'group_2', 'group_3', 'group_3']
pdf['groups'] = group_col

#           a       b       c       groups
#   0       0       1       2       group_1
#   1       3       4       5       group_1
#   2       6       7       8       group_2
#   3       9       10      11      group_2
#   4       12      13      14      group_3
#   5       15      16      17      group_3

f = HMF.open_file('pandasExample', mode='w+')

You can then specify groupby:

f.from_pandas(pdf, groupby='groups')  # You can also specify "orderby" in order to sort the array by a particular column:

f.register_array('arrayA', ['b', 'c'])
f.register_array('arrayB', ['a', 'b'])

f.close()

# Progress: |██████████████████████████████████████████████████| 100.0% Completed!

Now, when you get the array, the groups have been automatically created, defined by the value of the groupby column, and you can query them using get_array:

f.get_array('/group_1/arrayA')  # get data array "arrayA" for partition group "group_1"

# memmap([[1, 2],
#         [4, 5]])

f.get_array('/group_3/arrayB')  # get data array "arrayB" for partition group "group_3"

# memmap([[12, 13],
#         [15, 16]])

Getting back dataframe

What if you want to get the dataframe back instead of numpy array or memmap? You must register dataframe instead of array in this case:

f.register_dataframe('arrayA', ['b', 'c'])
f.register_dataframe('arrayB', ['a', 'b'])

f.close()

But just as with array, make sure that the data type of specified column names is numeric (not even boolean is allowed, convert boolean to 0/1)

Then you can retrieve the data either as numpy array (or memmap) or dataframe: (in both cases, the idx parameter works the same way)

f.get_dataframe('/group_3/arrayB')

#           a       b
#   0       12      13
#   1       15      16

f.get_array('/group_3/arrayB')

# memmap([[12, 13],
#         [15, 16]])

Convenient methods when working with Pandas

The HMF object, namely f in our examples, is meant to be used as a single file handler that can be used alone to write and query data easily. The following methods are provided to further this goal of ease of use when from_pandas is used:

f.has_groups()  # returns boolean flag for presence of groups (True if groupby is not None)

f.get_group_names()  # returns names of the groups

f.get_group_sizes()  # returns the sizes of the groups (i.e. number of rows in each group)

f.get_group_items()  # returns the dict of {name: size}

f.get_sorted_group_names()  # returns names of the groups sorted by the group size

f.get_sorted_group_sizes()  # returns sizes of the groups sorted by the group size

f.get_sorted_group_items()  # returns a list of tuple of (name, size) sorted by the group size

Project details

Release history Release notifications | RSS feed

0.0b45 pre-release

Jun 9, 2021

0.0b44 pre-release

Jun 8, 2021

0.0b43 pre-release

Jun 8, 2021

0.0b42 pre-release

Jun 8, 2021

0.0b41 pre-release

May 24, 2021

0.0b40 pre-release

May 6, 2021

This version

0.0b39 pre-release

May 6, 2021

0.0b38 pre-release

Apr 28, 2021

0.0b37 pre-release

Apr 28, 2021

0.0b36 pre-release

Apr 9, 2021

0.0b35 pre-release

Apr 1, 2021

0.0b34 pre-release

Mar 17, 2021

0.0b33 pre-release

Jan 12, 2021

0.0b32 pre-release

Jan 12, 2021

0.0b30 pre-release

Jan 11, 2021

0.0b29 pre-release

Dec 31, 2020

0.0b28 pre-release

Dec 30, 2020

0.0b27 pre-release

Dec 30, 2020

0.0b26 pre-release

Dec 9, 2020

0.0b25 pre-release

Oct 18, 2020

0.0b24 pre-release

Sep 21, 2020

0.0b23 pre-release

Sep 21, 2020

0.0b22 pre-release

Sep 21, 2020

0.0b21 pre-release

Sep 19, 2020

0.0b20 pre-release

Sep 19, 2020

0.0b19 pre-release

Sep 19, 2020

0.0b18 pre-release

Sep 8, 2020

0.0b16 pre-release

Sep 8, 2020

0.0b15 pre-release

Sep 8, 2020

0.0b13 pre-release

Jul 9, 2020

0.0b12 pre-release

Jul 1, 2020

0.0b11 pre-release

Jun 30, 2020

0.0b10 pre-release

Jun 30, 2020

0.0b9 pre-release

Jun 30, 2020

0.0b8 pre-release

Jun 30, 2020

0.0b7 pre-release

Jun 30, 2020

0.0b5 pre-release

Jun 29, 2020

0.0b4 pre-release

Jun 29, 2020

0.0b3 pre-release

Jun 29, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hierarchical-memmap-format-0.0b39.tar.gz (15.1 kB view details)

Uploaded May 6, 2021 Source

Built Distribution

hierarchical_memmap_format-0.0b39-py3-none-any.whl (16.7 kB view details)

Uploaded May 6, 2021 Python 3

File details

Details for the file hierarchical-memmap-format-0.0b39.tar.gz.

File metadata

Download URL: hierarchical-memmap-format-0.0b39.tar.gz
Upload date: May 6, 2021
Size: 15.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.1

File hashes

Hashes for hierarchical-memmap-format-0.0b39.tar.gz
Algorithm	Hash digest
SHA256	`0a1bb2672f3329ca3bf39e2d1c1b62580a0e4d6b93b8092eb506e894d0b14aa6`
MD5	`52e980358f7aecb3e3a0327712c9ae3d`
BLAKE2b-256	`3105fa33776be1a308b597bcd3f3ef0b09354953836ed7a13bc989399663e48f`

See more details on using hashes here.

File details

Details for the file hierarchical_memmap_format-0.0b39-py3-none-any.whl.

File metadata

Download URL: hierarchical_memmap_format-0.0b39-py3-none-any.whl
Upload date: May 6, 2021
Size: 16.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.1

File hashes

Hashes for hierarchical_memmap_format-0.0b39-py3-none-any.whl
Algorithm	Hash digest
SHA256	`440f6917260ca4692864522d4b07674e8457dea39ccd6e88539b1ae95cc7ee8c`
MD5	`5f7b279b025b6153dcc554871e83c9cf`
BLAKE2b-256	`5e24c15f087e78b8e7d45866c70ad430dc564d2b2fc47fbb433583d9597a2e63`