Skip to main content

articat: data artifact catalog

Project description

articat

CI PYPI

Minimal metadata catalog to store and retrieve metadata about data artifacts.

Getting started

At a high level, articat is simply a key-value store. Value being the Artifact metadata. Key a.k.a. "Artifact Spec" being:

  • globally unique id
  • optional timestamp: partition
  • optional arbitrary string: version

To publish a file system Artifact (FSArtifact):

from articat import FSArtifact
from pathlib import Path
from datetime import date

# Apart from being a metadata containers, Artifact classes have optional
# convenience methods to help in data publishing flow:

with FSArtifact.partitioned("foo", partition=date(1643, 1, 4)) as fsa:
    # To create a new Artifact, always use `with` statement, and
    # either `partitioned` or `versioned` methods. Use:
    # * `partitioned(...)`, for Artifacts with explicit `datetime` partition
    # * `versioned(...)`, for Artifacts with explicit `str` version

    # Next we produce some local data, this could be a Spark job,
    # ML model etc.
    data_path = Path("/tmp/data")
    data_path.write_text("42")

    # Now let's stage that data, temporary and final data directories/buckets
    # are configurable (see below)
    fsa.stage(data_path)

    # Additionally let's provide some description, here we could also
    # save some extra arbitrary metadata like model metrics, hyperparameters etc.
    fsa.metadata.description = "Answer to the Ultimate Question of Life, the Universe, and Everything"

To retrieve the metadata about the Artifact above:

from articat.fs_artifact import FSArtifact
from datetime import date
from pathlib import Path

# To retrieve the metadata, use Artifact object, and `fetch` method:
fsa = FSArtifact.partitioned("foo", partition=date(1643, 1, 4)).fetch()

fsa.id # "foo"
fsa.created # <CREATION-TIMESTAMP>
fsa.partition # <CREATION-TIMESTAMP>
fsa.metadata.description # "Answer to the Ultimate Question of Life, the Universe, and Everything"
fsa.main_dir # Data directory, this is where the data was stored after staging
Path(fsa.joinpath("data")).read_text() # 42

Features

  • store and retrieve metadata about your data artifacts
  • no long running services (low maintenance)
  • data publishing utils builtin
  • IO/data format agnostic
  • immutable metadata
  • development mode

Artifact flavours

Currently available Artifact flavours:

  • FSArtifact: metadata/utils for files or objects (supports: local FS, GCS, S3 and more)
  • BQArtifact: metadata/utils for BigQuery tables
  • NotebookArtifact: metadata/utils for Jupyter Notebooks

Development mode

To ease development of Artifacts, articat supports development/dev mode. Development Artifact can be indicated by dev parameter (preferred), or _dev prefix in the Artifact id. Dev mode supports:

  • overwriting Artifact metadata
  • configure separate locations (e.g. dev_prefix for FSArtifact), with potentially different retention periods etc

Backend

  • local: mostly for testing/demo, metadata is stored locally (configurable, default: ~/.config/articat/local)
  • gcp_datastore: metadata is stored in the Google Cloud Datastore

Configuration

articat configuration can be provided in the API, or configuration files. By default configuration is loaded from ~/.config/articat/articat.cfg and articat.cfg in current working directory. You can also point at the configuration file via environment variable ARTICAT_CONFIG.

You use local mode without configuration file. Available options:

[main]
# local or gcp_datastore, default: local
# mode =

# local DB directory, default: ~/.config/articat/local
# local_db_dir =

[fs]
# temporary directory/prefix
# tmp_prefix =
# development data directory/prefix
# dev_prefix =
# production data directory/prefix
# prod_prefix =

[gcp]
# GCP project
# project =

[bq]
# development data BigQuery dataset
# dev_dataset =
# production data BigQuery dataset
# prod_dataset =

Our/example setup

Below you can see a diagram of our setup, Articat is just one piece of our system, and solves a specific problem. This should give you an idea where it might fit into your environment:

Our setup diagram

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

articat-0.1.16.tar.gz (43.4 kB view details)

Uploaded Source

Built Distribution

articat-0.1.16-py3-none-any.whl (52.0 kB view details)

Uploaded Python 3

File details

Details for the file articat-0.1.16.tar.gz.

File metadata

  • Download URL: articat-0.1.16.tar.gz
  • Upload date:
  • Size: 43.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for articat-0.1.16.tar.gz
Algorithm Hash digest
SHA256 b6480cc1c0d41bf72b7225737ffe0cdebad0bae7349b531deba14f2b2a0b7e83
MD5 77375ea9a603b0ab5fcc71d4a98bfc6e
BLAKE2b-256 ba16bb49e6fc33b3b4e04a965cd2fe7beb519d1b7cb36caef383dbb58c7b20d0

See more details on using hashes here.

File details

Details for the file articat-0.1.16-py3-none-any.whl.

File metadata

  • Download URL: articat-0.1.16-py3-none-any.whl
  • Upload date:
  • Size: 52.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for articat-0.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 a56f68685f56515162e8dbfb1e08d628d55b8980979e5461dbbf0830913049f9
MD5 000bbc5101e079eadfb4debad63103f7
BLAKE2b-256 fa7b00700d6438ad9a97c0b0ea3cfac5fc8fe16578395a19444baa1c00da4cab

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page