Skip to main content

No project description provided

Project description

Content-addressable data lake

WIP

Storing any kind of metadata in the name/path of a file brings mostly disadvantages. Let's put all files in one directory, and name it with a hash of its content (Content-addressable storage). Then, let's keep all related metadata in a database for easy manipulation.

The problem with storing metadata as paths:

  1. Path is hierarchical while metadata are multidimensional, one has to choose the order of dimensions.
  2. Paths are one-to-one to content, but the same content can correspond to multiple metadata
  3. Paths are either too long to be nice to write by hand (hive-partitioning) or opaque, requiring additional knowledge how to interpret it.
  4. Paths are string-typed.
  5. Querying available files is tedious, requires listing and parsing the paths.
  6. Any kind of change is extremely tedious to do. Requires moving around files and destructive operations or duplication.
  7. When two files match the same metadata, it is usually solved with some ugly suffix _2. When new versions of files appear, they are irreversibly overwritten.

Data lakes partially solve this because they need to avoid expensive file listing operations with huge numbers of files. But they go only half-way, all the metadata are stored somewhere in a database, but the paths still carry some semantic content, it usually contains some kind of date, name of dataset, or whatever. This solves some points, some not (1, 6, 7), and brings additional problems with transactionality, and the metadata are now in two places, impossible to keep in sync.

I find the point 6 the most pressing. Keeping any order in your data is only as hard as making changes to their structure. The ever-repeating story is that you

  1. gather some data, name it somehow, partition it somehow
  2. realize that there are additional parameters you did not consider, the partitioning is not exhaustive, you chose the name of the dataset wrongly, etc.
  3. make the dataset with other values an extra dataset alongside the first one OR go through the pain of restructuring the first dataset to accomodate this newly-discovered feature of the data, knowing that you will do this again the moment you discover a new parameter
  4. keep around many GBs of backups in case you did a mistake during the restructuralisation or pray that you didn't

It is mandatory that you are able to restructuralize the data easily if you want to keep any order in it. You must be able to present the new structure while keeping backwards compatibility, without duplication.

We only got used to not having these perks, because decades ago people decided that paths are the perfect level of abstraction at which people should interact with the filesystem. Anything below that is flat and addressed by arbitrary numbers. Let's move it up one level and make paths flat and addressed by uninformative numbers.

In the scenario above, had we content-addressed our files, we could have written one SQL statement to transfer our metadata into a new catalog and keep the old one basically for free.

Most of the advantages would already come with arbitrary addresses, but then we can get for free some other advantages by making the path related to a hash of the content.

There are many technologies that use content addressing, but it is always under the hood. If they are file-oriented, they still try to present a path-based interface, or they are very-domain specific and implement most interaction in a special-purpose client (photo managers, bibliography managers). The data catalog and hashed filenames are never exposed to the user.

For use-cases where the user is computer-literate, such as data science, content-addressability brings mostly joy. When the user is no longer looking at icons in windows and clicking on them, the convenience of informative paths pales in the comparison with easy manipulation of metadata in a format which is made for it -- databases.

Of course, there are some disadvantages. The filename -> metadata mapping is not obvious anymore, so deletion of old files is problematic and needs a dedicated garbage collector.

How to use this package

This project is a simple python implementation to be used for local medium size data, although ready for any fsspec+sqlalchemy backend (e.g. s3 + postgres). I have not investigated any scaling limits, but I have no suspicion that the same principle could not be used for bigger data.

TODO

Acknowledgement

I don't think any of these ideas are new, though I could not anyone talking about it. The biggest inspiration is coming from nix. I recommend reading Nix – taming Unix with functional programming and the linked article Imposing a Memory Management Discipline on Software Deployment. How about imposing a memory management discipline on user data?

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caslake-0.1.0.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

caslake-0.1.0-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file caslake-0.1.0.tar.gz.

File metadata

  • Download URL: caslake-0.1.0.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.3 Darwin/23.6.0

File hashes

Hashes for caslake-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fdd0b7dc9c983258e441c87a40d2a7655b9c805f0783c5e38d1db86be686e676
MD5 01756647ed64fe3e8ac81b9d37a0f24e
BLAKE2b-256 fdaeb9f8be2cb76e1b94893ef34b05936fd38c004c4218787a556186947d8748

See more details on using hashes here.

File details

Details for the file caslake-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: caslake-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.3 Darwin/23.6.0

File hashes

Hashes for caslake-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7df00487f2e1e22ef0130f28e811229fab618b0494161417129b46f92e3b69e3
MD5 0836fc424d8755514e791d826226edaf
BLAKE2b-256 e21a9bde36095b0ab3c28ee0d0eb1daf5f607d67ba67ec40b0771e1d52ef4b10

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page