Library for maintaining evolving tabular data sets
Project description
History Store (HISTORE) is a Pyhton package for maintaining snapshots of evolving data sets. This package provides an implementation of the core functionality that was implemented in the XML Archiver (XArch). The package is a lightweight implementation that is intended for maintaining data set snapshots that are represented as pandas data frames.
HISTORE is based on a nested merge approach that efficiently stores multiple dataset snapshots in a compact archive [Buneman, Khanna, Tajima, Tan. 2004]. The library allows one to create new archives, to merge new data set snapshots into an existing archive, and to retrieve data set snapshots from the archive.
Installation
Install histore from the Python Package Index (PyPI) using pip with:
pip install histore
Examples
HISTORE maintains data set versions (snapshots) in an archive. A separate archive is created for each data set. The package currently provides two different types of archive: a volatile archive that maintains all data set snapshots in main-memory and a persistent archive that writes data set snapshots to disk.
Example using Volatile Archive
Start by creating a new archive. At creating time, a primary key (list of column names) can be specified. If a promary key is given, the values in the key attributes are used as row keys when data set snapshots are merged into the archive. If no primary key is specified the row index of the data frame is used to match rows during the merge phase.
# Create a new archive that merges snapshots
# based on a primary key attribute
import histore as hs
archive = hs.Archive(primary_key='Name')
Add the first two data set versions to the archive:
import pandas as pd
# First version
df = pd.DataFrame(
data=[['Alice', 32], ['Bob', 45], ['Claire', 27], ['Dave', 23]],
columns=['Name', 'Age']
)
archive.commit(df, description='First snapshot')
# Second version: Change age for Alice and Bob
df = pd.DataFrame(
data=[['Alice', 33], ['Bob', 44], ['Claire', 27], ['Dave', 23]],
columns=['Name', 'Age']
)
archive.commit(df, description='Alice is 33 and Bob 44')
List information about all snapshots in the archive. This also shows how to use the checkout method to retrieve a particular data set version:
# Print all data frame versions
for s in archive.snapshots():
df = archive.checkout(s.version)
print('({}) {}\n'.format(s.version, s.description))
print(df)
print()
The result should look like this:
(0) First snapshot
Name Age
0 Alice 32
1 Bob 45
2 Claire 27
3 Dave 23
(1) Alice is 33 and Bob 44
Name Age
0 Alice 33
1 Bob 44
2 Claire 27
3 Dave 23
Example using Persistent Archive
To create persistent archive that maintains all data on disk use the PersistentArchive class:
archive = hs.PersistentArchive(basedir='path/to/archive/dir', primary_key=['Name'])
The persistent archive maintains the data set snapshots in two files that are created in the directory that is given as the basedir argument.
For more examples see the notebooks in the examples folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file histore-0.2.1.tar.gz
.
File metadata
- Download URL: histore-0.2.1.tar.gz
- Upload date:
- Size: 67.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d9fd230731eccdd3745bb1d2d5ed1678564d07081a94ca0c7265ee7c56aca27 |
|
MD5 | bd5a0ba01fde91929a48be0247a3bc54 |
|
BLAKE2b-256 | 7d696e573539e05f116339c1c844f85f5b4435acd50a25bfc0f1c95e14f8a86e |
File details
Details for the file histore-0.2.1-py3-none-any.whl
.
File metadata
- Download URL: histore-0.2.1-py3-none-any.whl
- Upload date:
- Size: 99.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 300e00cef63358acd42be896daae7833565c0790111b3eb487b5315d116153f3 |
|
MD5 | 0f17a5d05cf006d6cf270e74a6db224d |
|
BLAKE2b-256 | 67df8e98b72335355122b18f23feda514e1349f787d5f029e8ce58545a53ad2f |