Project description

dirdb

A very primitive "database" interface using the file system directly. Databases are directories and each object is a pickled file in that directory together with an optional JSON file for metadata.

Use Case

Generally: you have few (hundreds) datasets, but each set is potentially big (up to memory limit).

My use case was for some simple machine learning. Workers were generating large in-memory numpy arrays. Each worker dumped this training data in a common place, where the model generator could iterate over them and update the metadata with statistics.

This is obviously not a good choice if you have millions of datasets or if you need any actual database features.

Example

import numpy as np
from dirdb import dirdb

# A database is simply a valid directory path. It will be created if
# it doesn't exist. Here we create the directory `./testdb`.
db = dirdb('testdb')

# It acts a lot like a dict() and returns "database entry" which is a
# proxy element that lets you inspect the entry's JSON metadata and
# can be used to load the associated data set.

assert "foo" not in db
entry = db["foo"]

# Note that entry also boolean-evaluates to False because it has no
# data evaluated with it.
assert not entry

# Once we put some data into it, the data will go into
# `./testdb/foo.pickle` and the `.meta` attribute (if not None) will
# be stored in `./testdb/foo.json`.
entry.put_data(np.random.rand(5,5), meta={'shape': (5,5)})

assert (entry.name in db) and entry

# These entry objects can be used in a `with:` statement to lock them
# (optional, but recommended).
with entry:
  # Deletes the meta data. (This action will be flushed to disk.)
  entry.meta = None

  # This is also flushed.
  entry.meta = {'test': [1,2]}

  # The meta dict attribute acts sort of like a javascript object.
  # Meta data is not flushed here.
  entry.meta.test.append(3)

  # But it will be flushed upon exit of the with: block --v

# Saves some data without .json metadata.
db['bar'].put_data([1] * 1000)

# Later, reloading the data:
with db['foo'] as e:
  # Inspect the metadata (loads the .json file):
  print("the updated meta:", e.meta)

  # Retrieve the data:
  print("data was:\n", e.get_data())

# Iterate over elements in the directory.
for entry in db:
  with entry:
    print(f"{entry.name} w/ meta data {str(entry.meta)}")

How It Works

Databases are file system directories. Each dataset consists of 1-3 files (the pickled data, an optional meta-data JSON file, and a lock file).

Datasets are loaded and saved in full. There's no slicing. It uses pickle to save and load data.

Use with on an entry to lock that entry (file locks). This is optional but can be used to block other processes from using it. There's no need to lock the database itself.

Each dataset can have a meta-data dictionary associated with it. Accessing entry.meta automatically creates such a dictionary. This object behaves like a dict() and will be saved as a .json file. The purpose of this is to just have a smaller object that can be loaded and inspected without loading the data itself.

Why?

Twice I had large HDF5 databases become corrupted due to unexpected reboots, costing me hours or even days of work. This can very easily happen if something is written to the HDF5 file but the contents isn't flushed; it's an easy way to simply lose your entire dataset, because there's no good tools to repair broken files. It was incredibly frustrating.

I wrote this since I didn't need most of the functionality of HDF5 anyway.

Todo

deletion
stat()
more consistent API
sorting?

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.0

Jan 15, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dirdb-0.1.0.tar.gz (4.9 kB view hashes)

Uploaded Jan 15, 2021 Source

Built Distribution

dirdb-0.1.0-py3-none-any.whl (5.5 kB view hashes)

Uploaded Jan 15, 2021 Python 3

Hashes for dirdb-0.1.0.tar.gz

Hashes for dirdb-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3c8d128f74a739dddc238919594e6b4e88234c37473a2d7063f9205d291f90d8`
MD5	`56074a461eef0119268334c5f7f31e15`
BLAKE2b-256	`ce6effc06dd39310c4c141f7ae5b00321b23fb588ddf5af973db91bf8110b0a0`

Hashes for dirdb-0.1.0-py3-none-any.whl

Hashes for dirdb-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d9353f6bad07585f910fbc916f4814736a94f89bbf259085003cc0e3c5ab422d`
MD5	`2e025c48ed3e509929a47cdb7ff6fe67`
BLAKE2b-256	`14743072651dd1f72f706bd7186b1fdc147d5f215755fe1804da46194e32401b`