Skip to main content

No project description provided

Project description

About

ElliptIO is a small python library for storing and accessing files in data lakes in a data science context. It stores files including automatically generated metadata on any file system and inserts metada into a database. A lot of inspiration is drawn from Weights & Biases.

It is named after the Elliptio mussel genus which lives in freshwater lakes.

Problems and solution approach

Particular in data science you often find data lakes where...

Problem Solution approach
data cannot be reproduced Automatically log required information.
data lineage is unknown Automatically track lineage between files.
data is accidentally modified Lock files using S3 lock.
data has no metadata Users can specify custom metadata when saving files.
directory structure is chaotic Simply save files by date and user. A good metadata search makes structure much less important.
data is duplicated Automatically replace duplicated files with references (not yet implemented)

Existing solutions

I find Weights and Biases a great app, from which a lot of inspiration is drawn. However, it can be rather expensive and focuses on a lot more things than just data storage, so can easily be an overkill.

Object stores such as S3 or Ceph already provide the option to store metadata. However, this does not cover all required data for reproducibility. Also, querying metadata is not as efficient as querying a database.

How to use

import json
import pandas as pd
from elliptio import get_default_handler, ManualMetadata

# setup manual metadata (optional) and handler
metadata = ManualMetadata(
    ticket="abc-123",
    project="my_project",
    config=json.dumps({"example": "value"}),
    description="lorem ipsum",
)
h = get_handler(dirpath="/tmp/my_data_lake", manual_metadata=metadata)

# save file directly to remote
df = pd.DataFrame({"a": [1], "b": [2]})
with h.create("train.txt") as f:
    df.to_csv(f.remote_url)

# load file. Its file_id will be added to every new file in this session.
train_file = h.load(f.file_id)

# upload an existing new file
# model.train(train_file)
with h.create("model.pickle") as model:
    model.upload("/tmp/my_data_lake/best_model.pickle")
assert model.based_on == [train_file.file_id]

# querying the database
df = h.query({"ticket": "abc-123"})

How to install

Simply run pip install elliptio.

Tips

  • You can easily pass custom filesystem, database, tracker and id_creator classes to get_handler
  • The current filesystem class is based on fsspec and thus should support all their filesystem implementations (S3, Azure Blob service, Google Cloud Storage, etc.). See example below.
  • To create a nice GUI for your database, I can recommend Metabase. Metabase, without the enterprise features, is APGL licensed. You have to be careful when modifying the code or incorporating it into your application, but running the app without modifications internally in "vanilla mode" seems to be fine according to them.
  • The terraform/ directory contains example Terraform code to setup S3 and a free MongoDB on AWS. However, there's currently no MongoDB implementation for the DatabaseInterface.
# Example for passing custom FileSystemInterfaces like S3
from elliptio.adapters import fs,db
from elliptio import get_handler

h = get_handler(
    fs=fs.FsspecFilesystem(prefix="some/prefix/", protocol="s3", storage_options={}),
    db=db.SqlDatabase("db.sqlite"),
)

TODOs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

elliptio-0.2.1.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

elliptio-0.2.1-py3-none-any.whl (26.0 kB view details)

Uploaded Python 3

File details

Details for the file elliptio-0.2.1.tar.gz.

File metadata

  • Download URL: elliptio-0.2.1.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.12.3 CPython/3.10.12

File hashes

Hashes for elliptio-0.2.1.tar.gz
Algorithm Hash digest
SHA256 8a1140a9f03fab20c557a246ced51c0ccd5fd8c40b5dd95d9f8559e055b58097
MD5 e6764049455daed10840c860d83f189a
BLAKE2b-256 db4c23716b9b1bb188cc3cf8b58e7bfeb98bd1e12a5af3928ebcb01c734a5394

See more details on using hashes here.

File details

Details for the file elliptio-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: elliptio-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 26.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.12.3 CPython/3.10.12

File hashes

Hashes for elliptio-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f51fab90c0f94e16e55b464d832df1070521eeabfe0ee6f0069e970442626118
MD5 338f116514f816f3b1662f5e5cf639aa
BLAKE2b-256 e3f5c7b18b74ed3443948d92ca221909df539b8ba7f2aeaf8c92bf2697d86a65

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page