Project description

Content-addressable data lake

WIP

Storing any kind of metadata in the name/path of a file brings mostly disadvantages. Let's put all files in one directory, and name it with a hash of its content (Content-addressable storage). Then, let's keep all related metadata in a database for easy manipulation.

The problem with storing metadata as paths:

Path is hierarchical while metadata are multidimensional, one has to choose the order of dimensions.
Paths are one-to-one to content, but the same content can correspond to multiple metadata
Paths are either too long to be nice to write by hand (hive-partitioning) or opaque, requiring additional knowledge how to interpret it.
Paths are string-typed.
Querying available files is tedious, requires listing and parsing the paths.
Any kind of change is extremely tedious to do. Requires moving around files and destructive operations or duplication.
When two files match the same metadata, it is usually solved with some ugly suffix _2. When new versions of files appear, they are irreversibly overwritten.

Data lakes partially solve this because they need to avoid expensive file listing operations with huge numbers of files. But they go only half-way, all the metadata are stored somewhere in a database, but the paths still carry some semantic content, it usually contains some kind of date, name of dataset, or whatever. This solves some points, some not (1, 6, 7), and brings additional problems with transactionality, and the metadata are now in two places, impossible to keep in sync.

I find the point 6 the most pressing. Keeping any order in your data is only as hard as making changes to their structure. The ever-repeating story is that you

gather some data, name it somehow, partition it somehow
realize that there are additional parameters you did not consider, the partitioning is not exhaustive, you chose the name of the dataset wrongly, etc.
make the dataset with other values an extra dataset alongside the first one OR go through the pain of restructuring the first dataset to accomodate this newly-discovered feature of the data, knowing that you will do this again the moment you discover a new parameter
keep around many GBs of backups in case you did a mistake during the restructuralisation or pray that you didn't

It is mandatory that you are able to restructuralize the data easily if you want to keep any order in it. You must be able to present the new structure while keeping backwards compatibility, without duplication.

We only got used to not having these perks, because decades ago people decided that paths are the perfect level of abstraction at which people should interact with the filesystem. Anything below that is flat and addressed by arbitrary numbers. Let's move it up one level and make paths flat and addressed by uninformative numbers.

In the scenario above, had we content-addressed our files, we could have written one SQL statement to transfer our metadata into a new catalog and keep the old one basically for free.

Most of the advantages would already come with arbitrary addresses, but then we can get for free some other advantages by making the path related to a hash of the content.

There are many technologies that use content addressing, but it is always under the hood. If they are file-oriented, they still try to present a path-based interface, or they are very-domain specific and implement most interaction in a special-purpose client (photo managers, bibliography managers). The data catalog and hashed filenames are never exposed to the user.

For use-cases where the user is computer-literate, such as data science, content-addressability brings mostly joy. When the user is no longer looking at icons in windows and clicking on them, the convenience of informative paths pales in the comparison with easy manipulation of metadata in a format which is made for it -- databases.

Of course, there are some disadvantages. The filename -> metadata mapping is not obvious anymore, so deletion of old files is problematic and needs a dedicated garbage collector.

How to use this package

This project is a simple python implementation to be used for local medium size data, although ready for any fsspec+sqlalchemy backend (e.g. s3 + postgres). I have not investigated any scaling limits, but I have no suspicion that the same principle could not be used for bigger data.

TODO

Acknowledgement

I don't think any of these ideas are new, though I could not anyone talking about it. The biggest inspiration is coming from nix. I recommend reading Nix – taming Unix with functional programming and the linked article Imposing a Memory Management Discipline on Software Deployment. How about imposing a memory management discipline on user data?

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Nov 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caslake-0.1.0.tar.gz (5.1 kB view details)

Uploaded Nov 6, 2024 Source

Built Distribution

caslake-0.1.0-py3-none-any.whl (5.8 kB view details)

Uploaded Nov 6, 2024 Python 3

File details

Details for the file caslake-0.1.0.tar.gz.

File metadata

Download URL: caslake-0.1.0.tar.gz
Upload date: Nov 6, 2024
Size: 5.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.6.1 CPython/3.11.3 Darwin/23.6.0

File hashes

Hashes for caslake-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`fdd0b7dc9c983258e441c87a40d2a7655b9c805f0783c5e38d1db86be686e676`
MD5	`01756647ed64fe3e8ac81b9d37a0f24e`
BLAKE2b-256	`fdaeb9f8be2cb76e1b94893ef34b05936fd38c004c4218787a556186947d8748`

See more details on using hashes here.

File details

Details for the file caslake-0.1.0-py3-none-any.whl.

File metadata

Download URL: caslake-0.1.0-py3-none-any.whl
Upload date: Nov 6, 2024
Size: 5.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.6.1 CPython/3.11.3 Darwin/23.6.0

File hashes

Hashes for caslake-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7df00487f2e1e22ef0130f28e811229fab618b0494161417129b46f92e3b69e3`
MD5	`0836fc424d8755514e791d826226edaf`
BLAKE2b-256	`e21a9bde36095b0ab3c28ee0d0eb1daf5f607d67ba67ec40b0771e1d52ef4b10`