Library and CLI for storing numeric data frames in HDF5
Project description
Python library and CLI for storing numeric data frames in HDF5.
Rationale
Pandas has utilities for storing data frames in HDF5, but it uses PyTables under the hood, which means it is limited to frames with a relatively low number of columns (low 1000s).
This library is intended for storing and querying arbitrarily large numeric matrices which have row and column names. It has a CLI which can export/import to/from delimited text, or it can be used from within Python with tight integration with Pandas.
This library stores only numeric matrices, so it cannot handle data frames with mixed types (e.g., some strings and some numbers).
Installation
From PyPI:
pip install h5df
Latest version:
git clone https://github.com/gilesc/h5df.git
cd h5df
python setup.py install --user
This installs the CLI script “h5df”, and a Python module with the same name.
Usage
$ cat in.txt
A B C
X 1 2 3
Y 4 5 5
$ h5df load foo.h5 /my/path < in.txt
$ h5df dump foo.h5 /my/path
A B C
...
To select an individual row or column, use “h5py row|column”:
$ h5df row foo.h5 X
CLI flags
Use h5df <cmd> --help for a full listing of options, but a few useful ones:
h5df load -v : will output progress as a matrix is loaded (every 100 rows)
h5df <any output command> -p N will output values with decimal precision N
API
The two main classes are h5df.Store and h5df.Frame, representing a HDF5 file and individual data frame, respectively. Here is some example usage:
>> from h5df import Store
>> import pandas as pd
>> import numpy as np
>> np.random.seed(0)
# Create a Store object; the default mode is read-only.
# See http://docs.h5py.org/en/latest/high/file.html for available modes
>> store = Store("test.h5df", mode="a")
>> index = ["A","B","C"]
>> columns = ["V","W","X","Y","Z"]
>> mkdf = lambda: pd.DataFrame(np.random.random((3,5)), index=index, columns=columns)
>> store.put("/frames/1", mkdf())
>> store.put("/frames/2", mkdf())
# Iterate through HDF5 paths corresponding to Frame objects
>> for key in store: print(key)
>> df1 = store["/frames/1"]
# Various selection options
# returns pandas.Series
>> df1.column("W")
>> df1.row("A")
# returns a pandas.DataFrame
>> df1.rows(["A","C"])
>> df1.columns(["W","Y"])
# Returns the whole Frame as a pandas.DataFrame
>> df1.to_frame()
The full list of methods supported by h5df.Frame is:
Frame.row(key) and Frame.column(key) - return a pandas.Series corresponding to the row/column
Frame.rows(keys) and Frame.columns(keys) - given a list of row/column index names, return an in-memory pandas.DataFrame corresponding to the subset of the overall Frame containing the desired rows or columns
Frame.shape - returns a tuple of (# rows, # columns)
Frame.to_frame() - return the entire Frame as an in-memory pandas.DataFrame. Make sure you have enough memory!
Frame.add(key, data) - add a new row to the matrix with the given unique key. Due to the way of
Performance notes
Data is indexed row-major. Thus row-based queries will be much faster. Generally you should pre-transpose your matrix before putting it into the Store to ensure that the most frequently queried axis will be on the rows.
The h5df.Store() constructor takes a keyword argument, “driver”. The full description of available drivers is at http://docs.h5py.org/en/latest/high/file.html . For Linux systems, the default stdio-based driver is “sec2”, whereas “core” will memory-map the whole HDF5 file. If your system supports it and the file is frequently used (and therefore will be in your OS page cache), “core” may be faster, especially for reads.
Limitations
Currently there is no way to select rows by numeric index location (i.e., the equivalent to pandas.DataFrame.iloc).
Rows are added one at a time and read through Python’s standard I/O and string manipulation facilities rather than added in batch and using Pandas’ optimized I/O. Since the HDF5 matrix must be resized with every row added, this is quite inefficient for writes.
Iterating through the frames in a HDF5 file, Store.__iter__ is quite inefficient if the file contains large numbers of frames.
All indexes are stored as strings, or to be more specific, np.dtype("|S100") encoded as "utf-8". This has several practical consequences:
numeric indices will be cast to strings and must be queried as strings
index and column names are currently limited to 100 UTF-8 characters
UTF-8 encoding is hardcoded and other encodings are not supported (thus, characters from other encodings that will fail str.encode("utf-8") will cause an error.
There are plans to fix these limitations in future versions.
License
AGPLv3+
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file h5df-0.1.0.tar.gz
.
File metadata
- Download URL: h5df-0.1.0.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ab68c29502333765258b9caa6152f8834a8dff947318b86261042c296785394 |
|
MD5 | c82d8a97e76c8d1a8cfdab71d7298a61 |
|
BLAKE2b-256 | c90b347fb8b46c7ed3e2f89121db7b7e1b5d9cb371d89aa1a68ee5cda443b0b8 |