Skip to main content

Library for maintaining evolving tabular data sets

Project description

https://img.shields.io/pypi/pyversions/histore.svg https://badge.fury.io/py/histore.svg https://github.com/heikomuller/histore/workflows/build/badge.svg https://codecov.io/gh/heikomuller/histore/branch/master/graph/badge.svg https://img.shields.io/badge/License-BSD-green.svg
History Store

History Store (HISTORE) is a Pyhton package for maintaining snapshots of evolving data sets. This package provides an implementation of the core functionality that was implemented in the XML Archiver (XArch). The package is a lightweight implementation that is intended for maintaining data set snapshots that are represented as pandas data frames.

HISTORE is based on a nested merge approach that efficiently stores multiple dataset snapshots in a compact archive [Buneman, Khanna, Tajima, Tan. 2004]. The library allows one to create new archives, to merge new data set snapshots into an existing archive, and to retrieve data set snapshots from the archive.

Installation

Install histore from the Python Package Index (PyPI) using pip with:

pip install histore

Examples

HISTORE maintains data set versions (snapshots) in an archive. A separate archive is created for each data set. The package currently provides two different types of archive: a volatile archive that maintains all data set snapshots in main-memory and a persistent archive that writes data set snapshots to disk.

Example using Volatile Archive

Start by creating a new archive. For each archive, a optional primary key (list of column names) can be specified. If a primary key is given, the values in the key attributes are used as row keys when data set snapshots are merged into the archive. If no primary key is specified the row index of the data frame is used to match rows during the merge phase.

For archives that have a primary key, the initial dataset snapshot (or at least the dataset schema) needs to be given when creating the archive.

 # Create a new archive that merges snapshots
 # based on a primary key attribute

 import histore as hs
 import pandas as pd

# First version
 df = pd.DataFrame(
     data=[['Alice', 32], ['Bob', 45], ['Claire', 27], ['Dave', 23]],
     columns=['Name', 'Age']
 )
 archive = hs.Archive(doc=df, primary_key='Name', descriptor=hs.Descriptor('First snapshot'))

Add the first two data set versions to the archive:

# Second version: Change age for Alice and Bob
df = pd.DataFrame(
    data=[['Alice', 33], ['Bob', 44], ['Claire', 27], ['Dave', 23]],
    columns=['Name', 'Age']
)
archive.commit(df, descriptor=hs.Descriptor('Alice is 33 and Bob 44'))

List information about all snapshots in the archive. This also shows how to use the checkout method to retrieve a particular data set version:

# Print all data frame versions
for s in archive.snapshots():
    df = archive.checkout(s.version)
    print('({}) {}\n'.format(s.version, s.description))
    print(df)
    print()

The result should look like this:

(0) First snapshot

     Name  Age
0   Alice   32
1     Bob   45
2  Claire   27
3    Dave   23

(1) Alice is 33 and Bob 44

     Name  Age
0   Alice   33
1     Bob   44
2  Claire   27
3    Dave   23

Example using Persistent Archive

To create persistent archive that maintains all data on disk use the PersistentArchive class:

archive = hs.PersistentArchive(basedir='path/to/archive/dir', create=True, doc=df, primary_key=['Name'])

The persistent archive maintains the data set snapshots in two files that are created in the directory that is given as the basedir argument.

For more examples see the notebooks in the examples folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

histore-0.4.1.tar.gz (76.5 kB view details)

Uploaded Source

Built Distribution

histore-0.4.1-py3-none-any.whl (109.8 kB view details)

Uploaded Python 3

File details

Details for the file histore-0.4.1.tar.gz.

File metadata

  • Download URL: histore-0.4.1.tar.gz
  • Upload date:
  • Size: 76.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for histore-0.4.1.tar.gz
Algorithm Hash digest
SHA256 c4321f95a04b86d920eb3a80f2d4196002222e547d7c38a70ed857e5e5fbbef6
MD5 fd6daa9dfa8c60b4f754b2bc60b61a72
BLAKE2b-256 b89d1415d1e0a0458dc4783b1fe4aa4251249641ea5cf7d5f0bea03b8caed1f9

See more details on using hashes here.

File details

Details for the file histore-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: histore-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 109.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for histore-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 df832c76c66e59090ffaed440ced9018c22b245426760d5e3ddcbabb42174a6f
MD5 0221c49f3a0fd0d30098e84f6c0cbc78
BLAKE2b-256 968fb8b0c8119c2139d399756b9c1128f6c8c81f96f2a3d87370e1e8a3ea5d6f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page