Robust round-trip persistence between pandas and HDF5 with SWMR support
Project description
pandas2hdf
Robust round-trip persistence between pandas Series/DataFrame and HDF5 with SWMR (Single Writer Multiple Reader) support.
Features
- Complete round-trip fidelity: Preserves data types, index structure, names, and missing values
- SWMR support: Enables concurrent reading while writing with HDF5's Single Writer Multiple Reader mode
- Flexible write modes: Preallocate, new, update, and append operations
- MultiIndex support: Full support for pandas MultiIndex with proper reconstruction
- Type safety: Comprehensive type hints and strict mypy compliance
- Comprehensive testing: Extensive test suite covering edge cases and real-world scenarios
Installation
pip install pandas2hdf
Quick Start
Basic Series Operations
import pandas as pd
import h5py
from pandas2hdf import save_series_new, load_series
# Create a pandas Series
series = pd.Series([1, 2, 3, None, 5],
index=['a', 'b', 'c', 'd', 'e'],
name='my_data')
# Save to HDF5 with SWMR support
with h5py.File('data.h5', 'w', libver='latest') as f:
f.swmr_mode = True
group = f.create_group('my_series')
save_series_new(group, series, require_swmr=True)
# Load from HDF5
with h5py.File('data.h5', 'r', swmr=True) as f:
group = f['my_series']
loaded_series = load_series(group)
print(loaded_series)
# Output preserves original data, index, and name
DataFrame Operations
import pandas as pd
import h5py
from pandas2hdf import save_frame_new, load_frame
# Create a DataFrame with mixed types
df = pd.DataFrame({
'integers': [1, 2, 3, None],
'floats': [1.1, 2.2, 3.3, 4.4],
'strings': ['apple', 'banana', None, 'date'],
'booleans': [True, False, True, None]
})
# Save DataFrame
with h5py.File('dataframe.h5', 'w', libver='latest') as f:
f.swmr_mode = True
group = f.create_group('my_dataframe')
save_frame_new(group, df, require_swmr=True)
# Load DataFrame
with h5py.File('dataframe.h5', 'r', swmr=True) as f:
group = f['my_dataframe']
loaded_df = load_frame(group)
print(loaded_df)
SWMR Workflow with Incremental Updates
import pandas as pd
import h5py
from pandas2hdf import (
preallocate_series_layout,
save_series_new,
save_series_append,
load_series
)
# Writer process
with h5py.File('timeseries.h5', 'w', libver='latest') as f:
f.swmr_mode = True
group = f.create_group('data')
# Preallocate space for efficient appending
initial_data = pd.Series([1.0, 2.0], name='measurements')
preallocate_series_layout(group, initial_data, preallocate=10000)
# Write initial data
save_series_new(group, initial_data, require_swmr=True)
# Append new data incrementally
for i in range(10):
new_data = pd.Series([float(i + 3)], name='measurements')
save_series_append(group, new_data, require_swmr=True)
f.flush() # Make data visible to readers
# Concurrent reader process
with h5py.File('timeseries.h5', 'r', swmr=True) as f:
group = f['data']
current_data = load_series(group)
print(f"Current length: {len(current_data)}")
API Reference
Series Functions
preallocate_series_layout(): Create resizable datasets without writing datasave_series_new(): Create new datasets and write Series datasave_series_update(): Update Series data at specified positionsave_series_append(): Append Series data to end of existing datasetsload_series(): Load Series from HDF5 storage
DataFrame Functions
preallocate_frame_layout(): Create resizable layout for DataFramesave_frame_new(): Create new datasets and write DataFramesave_frame_update(): Update DataFrame data at specified positionsave_frame_append(): Append DataFrame data to existing datasetsload_frame(): Load DataFrame from HDF5 storage
Utility Functions
assert_swmr_on(): Assert that SWMR mode is enabled on a file
Data Type Handling
Values
- Numeric types (int, float): Stored as float64 with NaN for missing values
- Boolean: Converted to float64 (True=1.0, False=0.0) with NaN for missing
- Strings: Stored as UTF-8 variable-length strings with separate mask for missing values
Index
- All index types: Converted to UTF-8 strings for consistent storage
- MultiIndex: Each level stored separately with proper reconstruction metadata
- Missing values: Handled via mask arrays for all index levels
SWMR (Single Writer Multiple Reader) Support
pandas2hdf is designed for SWMR workflows where one process writes data while multiple processes read concurrently:
# Writer process
with h5py.File('data.h5', 'w', libver='latest') as f:
f.swmr_mode = True # Enable SWMR mode
# ... write operations with require_swmr=True
# Reader processes
with h5py.File('data.h5', 'r', swmr=True) as f:
# ... read operations (automatically see new data after writer flushes)
SWMR Requirements
- Use
libver='latest'when creating files - Set
swmr_mode = Trueon writer file handle - Use
require_swmr=Truefor write operations (validates SWMR is enabled) - Call
file.flush()after writes to make data visible to readers - Open reader files with
swmr=True
Error Handling
The library provides specific exception types:
SWMRModeError: SWMR mode required but not enabledSchemaMismatchError: Data doesn't match existing schemaValidationError: General data validation errors
Performance Considerations
- Chunking: Default chunk size is (25,) - adjust based on access patterns
- Compression: gzip compression enabled by default
- Preallocation: Specify expected size to avoid frequent resizing
- SWMR: Minimal overhead for concurrent reading
Testing
Run the comprehensive test suite:
pytest tests/
The tests cover:
- Round-trip fidelity for all supported data types
- MultiIndex handling
- All write modes (preallocate, new, update, append)
- SWMR workflows and concurrent access
- Error conditions and edge cases
- Performance with large datasets
Requirements
- Python ≥ 3.10
- pandas ≥ 1.5.0
- h5py ≥ 3.7.0
- numpy ≥ 1.21.0
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pandas2hdf-1.0.0.tar.gz.
File metadata
- Download URL: pandas2hdf-1.0.0.tar.gz
- Upload date:
- Size: 17.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3030dff3bda50c69602d4db57be4b48f8b65c2c319ef5719378ecd97ab4344ce
|
|
| MD5 |
2993be674624ce52a23c013e922e9d54
|
|
| BLAKE2b-256 |
df5e712134bd2281654509c2b1953239dde50cb0dd306ba1d39134ceaa81a83c
|
Provenance
The following attestation bundles were made for pandas2hdf-1.0.0.tar.gz:
Publisher:
publish.yml on Xander-git/pandas2hdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pandas2hdf-1.0.0.tar.gz -
Subject digest:
3030dff3bda50c69602d4db57be4b48f8b65c2c319ef5719378ecd97ab4344ce - Sigstore transparency entry: 486840906
- Sigstore integration time:
-
Permalink:
Xander-git/pandas2hdf@183e5716f3037f0d0ad8fa6231a53158d89883c8 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Xander-git
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@183e5716f3037f0d0ad8fa6231a53158d89883c8 -
Trigger Event:
workflow_run
-
Statement type:
File details
Details for the file pandas2hdf-1.0.0-py3-none-any.whl.
File metadata
- Download URL: pandas2hdf-1.0.0-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca3bf89ad5fe0db432e0da87881b9ca846e7e1c8f2b944f2b3793f52003e1e6e
|
|
| MD5 |
83fa686d359ec79ae0d77d9b35f080b1
|
|
| BLAKE2b-256 |
45055e08279a19c190d0c98dd03079fab84543c2aeff95253f9a50c5f3680105
|
Provenance
The following attestation bundles were made for pandas2hdf-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on Xander-git/pandas2hdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pandas2hdf-1.0.0-py3-none-any.whl -
Subject digest:
ca3bf89ad5fe0db432e0da87881b9ca846e7e1c8f2b944f2b3793f52003e1e6e - Sigstore transparency entry: 486840946
- Sigstore integration time:
-
Permalink:
Xander-git/pandas2hdf@183e5716f3037f0d0ad8fa6231a53158d89883c8 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Xander-git
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@183e5716f3037f0d0ad8fa6231a53158d89883c8 -
Trigger Event:
workflow_run
-
Statement type: