No project description provided

Project description

About

ElliptIO is a small python library for storing and accessing files in data lakes in a data science context. It stores files including automatically generated metadata on any file system and inserts metada into a database. A lot of inspiration is drawn from Weights & Biases.

It is named after the Elliptio mussel genus which lives in freshwater lakes.

Problems and solution approach

Particular in data science you often find data lakes where...

Problem	Solution approach
data cannot be reproduced	Automatically log required information.
data lineage is unknown	Automatically track lineage between files.
data is accidentally modified	Lock files using S3 lock.
data has no metadata	Users can specify custom metadata when saving files.
directory structure is chaotic	Simply save files by date and user. A good metadata search makes structure much less important.
data is duplicated	Automatically replace duplicated files with references (not yet implemented)

Existing solutions

I find Weights and Biases a great app, from which a lot of inspiration is drawn. However, it can be rather expensive and focuses on a lot more things than just data storage, so can easily be an overkill.

Object stores such as S3 or Ceph already provide the option to store metadata. However, this does not cover all required data for reproducibility. Also, querying metadata is not as efficient as querying a database.

How to use

import json
import pandas as pd
from elliptio import get_default_handler, ManualMetadata

# setup manual metadata (optional) and handler
metadata = ManualMetadata(
    ticket="abc-123",
    project="my_project",
    config=json.dumps({"example": "value"}),
    description="lorem ipsum",
)
h = get_handler(dirpath="/tmp/my_data_lake", manual_metadata=metadata)

# save file directly to remote
df = pd.DataFrame({"a": [1], "b": [2]})
with h.create("train.txt") as f:
    df.to_csv(f.remote_url)

# load file. Its file_id will be added to every new file in this session.
train_file = h.load(f.file_id)

# upload an existing new file
# model.train(train_file)
with h.create("model.pickle") as model:
    model.upload("/tmp/my_data_lake/best_model.pickle")
assert model.based_on == [train_file.file_id]

# querying the database
df = h.query({"ticket": "abc-123"})

How to install

Simply run pip install elliptio.

Tips

You can easily pass custom filesystem, database, tracker and id_creator classes to get_handler
The current filesystem class is based on fsspec and thus should support all their filesystem implementations (S3, Azure Blob service, Google Cloud Storage, etc.). See example below.
To create a nice GUI for your database, I can recommend Metabase. Metabase, without the enterprise features, is APGL licensed. You have to be careful when modifying the code or incorporating it into your application, but running the app without modifications internally in "vanilla mode" seems to be fine according to them.
The terraform/ directory contains example Terraform code to setup S3 and a free MongoDB on AWS. However, there's currently no MongoDB implementation for the DatabaseInterface.

# Example for passing custom FileSystemInterfaces like S3
from elliptio.adapters import db, fs
from elliptio import get_default_handler

h = get_default_handler(
    fs=fs.FsspecFilesystem(prefix="some/prefix/", protocol="s3", storage_options={}),
    db=db.SqlDatabase("db.sqlite"),
)

TODOs

compare with other data versioning tools from https://github.com/EthicalML/awesome-production-machine-learning/ (and other reproducibility tools)
automatic metadata
- automatically log git-hash and git diff (also from new untracked files)
- does argv work with Jupyter notebooks?!

Project details

Release history Release notifications | RSS feed

0.2.1

Feb 5, 2024

0.2.0

Feb 5, 2024

This version

0.1.0

Feb 5, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

elliptio-0.1.0.tar.gz (25.5 kB view hashes)

Uploaded Feb 5, 2024 Source

Built Distribution

elliptio-0.1.0-py3-none-any.whl (26.0 kB view hashes)

Uploaded Feb 5, 2024 Python 3

Hashes for elliptio-0.1.0.tar.gz

Hashes for elliptio-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`02082632bb2388fb89f990ab5a4889ce9456eb10a963cbac65fc38537fea161b`
MD5	`78c0348c783a226bb759bf501e33125c`
BLAKE2b-256	`b6ff9d3680724ba47bed4a9265ed207856a81846170ed2899c39119bbba1162b`

Hashes for elliptio-0.1.0-py3-none-any.whl

Hashes for elliptio-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4f877b5c7f11a13c2fde625448b7c524ab852d4ff957ffd8b58249e02a78d0f2`
MD5	`243527203d286320c66b053537810ef7`
BLAKE2b-256	`c135a150cbfaa7a80b94924a15798432236120e17aa0337487b343aa07ebee8a`