Robust round-trip persistence between pandas and HDF5 with SWMR support

Project description

pandas2hdf

Robust round-trip persistence between pandas Series/DataFrame and HDF5 with SWMR (Single Writer Multiple Reader) support.

Features

Complete round-trip fidelity: Preserves data types, index structure, names, and missing values
SWMR support: Enables concurrent reading while writing with HDF5's Single Writer Multiple Reader mode
Flexible write modes: Preallocate, new, update, and append operations
MultiIndex support: Full support for pandas MultiIndex with proper reconstruction
Type safety: Comprehensive type hints and strict mypy compliance
Comprehensive testing: Extensive test suite covering edge cases and real-world scenarios

Installation

pip install pandas2hdf

Quick Start

Basic Series Operations

import pandas as pd
import h5py
from pandas2hdf import save_series_new, load_series

# Create a pandas Series
series = pd.Series([1, 2, 3, None, 5], 
                  index=['a', 'b', 'c', 'd', 'e'], 
                  name='my_data')

# Save to HDF5 with SWMR support
with h5py.File('data.h5', 'w', libver='latest') as f:
    f.swmr_mode = True
    group = f.create_group('my_series')
    save_series_new(group, series, require_swmr=True)

# Load from HDF5
with h5py.File('data.h5', 'r', swmr=True) as f:
    group = f['my_series']
    loaded_series = load_series(group)

print(loaded_series)
# Output preserves original data, index, and name

DataFrame Operations

import pandas as pd
import h5py
from pandas2hdf import save_frame_new, load_frame

# Create a DataFrame with mixed types
df = pd.DataFrame({
    'integers': [1, 2, 3, None],
    'floats': [1.1, 2.2, 3.3, 4.4],
    'strings': ['apple', 'banana', None, 'date'],
    'booleans': [True, False, True, None]
})

# Save DataFrame
with h5py.File('dataframe.h5', 'w', libver='latest') as f:
    f.swmr_mode = True
    group = f.create_group('my_dataframe')
    save_frame_new(group, df, require_swmr=True)

# Load DataFrame
with h5py.File('dataframe.h5', 'r', swmr=True) as f:
    group = f['my_dataframe']
    loaded_df = load_frame(group)

print(loaded_df)

SWMR Workflow with Incremental Updates

import pandas as pd
import h5py
from pandas2hdf import (
    preallocate_series_layout, 
    save_series_new, 
    save_series_append,
    load_series
)

# Writer process
with h5py.File('timeseries.h5', 'w', libver='latest') as f:
    f.swmr_mode = True
    group = f.create_group('data')
    
    # Preallocate space for efficient appending
    initial_data = pd.Series([1.0, 2.0], name='measurements')
    preallocate_series_layout(group, initial_data, preallocate=10000)
    
    # Write initial data
    save_series_new(group, initial_data, require_swmr=True)
    
    # Append new data incrementally
    for i in range(10):
        new_data = pd.Series([float(i + 3)], name='measurements')
        save_series_append(group, new_data, require_swmr=True)
        f.flush()  # Make data visible to readers

# Concurrent reader process
with h5py.File('timeseries.h5', 'r', swmr=True) as f:
    group = f['data']
    current_data = load_series(group)
    print(f"Current length: {len(current_data)}")

API Reference

Series Functions

preallocate_series_layout(): Create resizable datasets without writing data
save_series_new(): Create new datasets and write Series data
save_series_update(): Update Series data at specified position
save_series_append(): Append Series data to end of existing datasets
load_series(): Load Series from HDF5 storage

DataFrame Functions

preallocate_frame_layout(): Create resizable layout for DataFrame
save_frame_new(): Create new datasets and write DataFrame
save_frame_update(): Update DataFrame data at specified position
save_frame_append(): Append DataFrame data to existing datasets
load_frame(): Load DataFrame from HDF5 storage

Utility Functions

assert_swmr_on(): Assert that SWMR mode is enabled on a file

Data Type Handling

Values

Numeric types (int, float): Stored as float64 with NaN for missing values
Boolean: Converted to float64 (True=1.0, False=0.0) with NaN for missing
Strings: Stored as UTF-8 variable-length strings with separate mask for missing values

Index

All index types: Converted to UTF-8 strings for consistent storage
MultiIndex: Each level stored separately with proper reconstruction metadata
Missing values: Handled via mask arrays for all index levels

SWMR (Single Writer Multiple Reader) Support

pandas2hdf is designed for SWMR workflows where one process writes data while multiple processes read concurrently:

# Writer process
with h5py.File('data.h5', 'w', libver='latest') as f:
    f.swmr_mode = True  # Enable SWMR mode
    # ... write operations with require_swmr=True

# Reader processes  
with h5py.File('data.h5', 'r', swmr=True) as f:
    # ... read operations (automatically see new data after writer flushes)

SWMR Requirements

Use libver='latest' when creating files
Set swmr_mode = True on writer file handle
Use require_swmr=True for write operations (validates SWMR is enabled)
Call file.flush() after writes to make data visible to readers
Open reader files with swmr=True

Error Handling

The library provides specific exception types:

SWMRModeError: SWMR mode required but not enabled
SchemaMismatchError: Data doesn't match existing schema
ValidationError: General data validation errors

Performance Considerations

Chunking: Default chunk size is (25,) - adjust based on access patterns
Compression: gzip compression enabled by default
Preallocation: Specify expected size to avoid frequent resizing
SWMR: Minimal overhead for concurrent reading

Testing

Run the comprehensive test suite:

pytest tests/

The tests cover:

Round-trip fidelity for all supported data types
MultiIndex handling
All write modes (preallocate, new, update, append)
SWMR workflows and concurrent access
Error conditions and edge cases
Performance with large datasets

Requirements

Python ≥ 3.10
pandas ≥ 1.5.0
h5py ≥ 3.7.0
numpy ≥ 1.21.0

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

Release history Release notifications | RSS feed

This version

1.0.0

Sep 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas2hdf-1.0.0.tar.gz (17.8 kB view details)

Uploaded Sep 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pandas2hdf-1.0.0-py3-none-any.whl (10.5 kB view details)

Uploaded Sep 9, 2025 Python 3

File details

Details for the file pandas2hdf-1.0.0.tar.gz.

File metadata

Download URL: pandas2hdf-1.0.0.tar.gz
Upload date: Sep 9, 2025
Size: 17.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pandas2hdf-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`3030dff3bda50c69602d4db57be4b48f8b65c2c319ef5719378ecd97ab4344ce`
MD5	`2993be674624ce52a23c013e922e9d54`
BLAKE2b-256	`df5e712134bd2281654509c2b1953239dde50cb0dd306ba1d39134ceaa81a83c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pandas2hdf-1.0.0.tar.gz:

Publisher: publish.yml on Xander-git/pandas2hdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pandas2hdf-1.0.0.tar.gz
- Subject digest: 3030dff3bda50c69602d4db57be4b48f8b65c2c319ef5719378ecd97ab4344ce
- Sigstore transparency entry: 486840906
- Sigstore integration time: Sep 9, 2025
Source repository:
- Permalink: Xander-git/pandas2hdf@183e5716f3037f0d0ad8fa6231a53158d89883c8
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Xander-git
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@183e5716f3037f0d0ad8fa6231a53158d89883c8
- Trigger Event: workflow_run

File details

Details for the file pandas2hdf-1.0.0-py3-none-any.whl.

File metadata

Download URL: pandas2hdf-1.0.0-py3-none-any.whl
Upload date: Sep 9, 2025
Size: 10.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pandas2hdf-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca3bf89ad5fe0db432e0da87881b9ca846e7e1c8f2b944f2b3793f52003e1e6e`
MD5	`83fa686d359ec79ae0d77d9b35f080b1`
BLAKE2b-256	`45055e08279a19c190d0c98dd03079fab84543c2aeff95253f9a50c5f3680105`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pandas2hdf-1.0.0-py3-none-any.whl:

Publisher: publish.yml on Xander-git/pandas2hdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pandas2hdf-1.0.0-py3-none-any.whl
- Subject digest: ca3bf89ad5fe0db432e0da87881b9ca846e7e1c8f2b944f2b3793f52003e1e6e
- Sigstore transparency entry: 486840946
- Sigstore integration time: Sep 9, 2025
Source repository:
- Permalink: Xander-git/pandas2hdf@183e5716f3037f0d0ad8fa6231a53158d89883c8
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Xander-git
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@183e5716f3037f0d0ad8fa6231a53158d89883c8
- Trigger Event: workflow_run

pandas2hdf 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

pandas2hdf

Features

Installation

Quick Start

Basic Series Operations

DataFrame Operations

SWMR Workflow with Incremental Updates

API Reference

Series Functions

DataFrame Functions

Utility Functions

Data Type Handling

Values

Index

SWMR (Single Writer Multiple Reader) Support

SWMR Requirements

Error Handling

Performance Considerations

Testing

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance