Skip to main content

No project description provided

Project description

About

ElliptIO is a small python library for storing and accessing files in data lakes in a data science context. It stores files including automatically generated metadata on any file system and inserts metada into a database. A lot of inspiration is drawn from Weights & Biases.

It is named after the Elliptio mussel genus which lives in freshwater lakes.

Problems and solution approach

Particular in data science you often find data lakes where...

Problem Solution approach
data cannot be reproduced Automatically log required information.
data lineage is unknown Automatically track lineage between files.
data is accidentally modified Lock files using S3 lock.
data has no metadata Users can specify custom metadata when saving files.
directory structure is chaotic Simply save files by date and user. A good metadata search makes structure much less important.
data is duplicated Automatically replace duplicated files with references (not yet implemented)

Existing solutions

I find Weights and Biases a great app, from which a lot of inspiration is drawn. However, it can be rather expensive and focuses on a lot more things than just data storage, so can easily be an overkill.

Object stores such as S3 or Ceph already provide the option to store metadata. However, this does not cover all required data for reproducibility. Also, querying metadata is not as efficient as querying a database.

How to use

import json
import pandas as pd
from elliptio import get_default_handler, ManualMetadata

# setup manual metadata (optional) and handler
metadata = ManualMetadata(
    ticket="abc-123",
    project="my_project",
    config=json.dumps({"example": "value"}),
    description="lorem ipsum",
)
h = get_handler(dirpath="/tmp/my_data_lake", manual_metadata=metadata)

# save file directly to remote
df = pd.DataFrame({"a": [1], "b": [2]})
with h.create("train.txt") as f:
    df.to_csv(f.remote_url)

# load file. Its file_id will be added to every new file in this session.
train_file = h.load(f.file_id)

# upload an existing new file
# model.train(train_file)
with h.create("model.pickle") as model:
    model.upload("/tmp/my_data_lake/best_model.pickle")
assert model.based_on == [train_file.file_id]

# querying the database
df = h.query({"ticket": "abc-123"})

How to install

Simply run pip install elliptio.

Tips

  • You can easily pass custom filesystem, database, tracker and id_creator classes to get_handler
  • The current filesystem class is based on fsspec and thus should support all their filesystem implementations (S3, Azure Blob service, Google Cloud Storage, etc.). See example below.
  • To create a nice GUI for your database, I can recommend Metabase. Metabase, without the enterprise features, is APGL licensed. You have to be careful when modifying the code or incorporating it into your application, but running the app without modifications internally in "vanilla mode" seems to be fine according to them.
  • The terraform/ directory contains example Terraform code to setup S3 and a free MongoDB on AWS. However, there's currently no MongoDB implementation for the DatabaseInterface.
# Example for passing custom FileSystemInterfaces like S3
from elliptio.adapters import db, fs
from elliptio import get_default_handler

h = get_default_handler(
    fs=fs.FsspecFilesystem(prefix="some/prefix/", protocol="s3", storage_options={}),
    db=db.SqlDatabase("db.sqlite"),
)

TODOs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

elliptio-0.1.0.tar.gz (25.5 kB view hashes)

Uploaded Source

Built Distribution

elliptio-0.1.0-py3-none-any.whl (26.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page